Skip to main content

User Guide

De-identifying data in 4 steps

The four key concepts to understand when de-identifying any dataset using the Privitar Platform are:

Overview of Schemas, Policies, Protected Data Domains and Jobs

  • A Schema is a description of the tables and columns of the input data, likely a Hive database, Avro, Parquet, or CSV file. All data input to Privitar must conform to a Schema.

  • A Policy describes a transformation of input into an output with privacy-preserving changes applied. Policies are the primary way in which privacy transformations are represented, so it is important that the Policy is created in line with an organisation's privacy requirements.

  • A Protected Data Domain (PDD) is a set of de-identified datasets intended for the same use. PDDs also capture metadata information such as restrictions, names of stakeholders, and data location. Datasets published to the same PDD can be de-identified consistently so that they are linkable; datasets published to different PDDs are not linkable. De-identified data can be traced back to its PDD using its embedded watermark, which is added to data during the de-identification process.

  • A Job represents the execution of a Policy on some input data, for a destination PDD. Batch Jobs are launched as a Spark Job on a Hadoop cluster from Privitar, and can also be used to re-run specific processing if new data is received or if there is a change to a Policy. Data Flow Jobs are used with external data flow pipelines, such as Apache NiFi, Kafka Connect/Confluent or Streamsets. Privitar On Demand Jobs can be used to enable de-identification in applications via a HTTP API.

Getting started

The procedure to de-identify a dataset using the Privitar platform is:

  1. Define a Schema for the new input data. Each table in the Schema contains column definitions, specified in terms of Privitar data types.

    • If there are multiple tables in the data, for example if the data is an extract from a relational database, the Schema will contain multiple, corresponding table definitions.

  2. Create a Policy representing the de-identification operations required. The Schema provides the basis for how the Policy is structured.

  3. Define the Protected Data Domain to be used as the destination for all data related to the use case in question.

  4. Based on the Policy, create and execute a Job. This is done by selecting a Policy and a destination Protected Data Domain.

    • Batch Jobs will be executed from Privitar.

    • Data Flow Jobs will be referenced from external processing pipelines.

    • Privitar On Demand Jobs are available via an API from the Privitar On Demand server.

    For more information about the different job types that available from Privitar, see What is a Job?