Skip to main content

User Guide

What is a PDD?

A Protected Data Domain (PDD) is a logical collection of datasets, brought together for a specific purpose, with strict linkage isolation from datasets in other PDDs.

A PDD should be created whenever datasets are required for a particular use, such as an analysis project, a development/testing activity, or sharing with an external party. The PDD definition remains in Privitar, allowing data to be added incrementally over time.

Processing in Privitar always produces output in a destination PDD. The PDD is selected when a de-identification Job is used.

In Privitar, PDDs contain the following information:

  • Metadata about the specific intended usage of the de-identified data. For example, the recipient and purpose of the data release.

  • The output location for de-identified data, if a Hadoop cluster is configured in the Environment. This may be a location in HDFS and/or a Hive database.

  • Metadata on the published datasets, including the Policies and Jobs used to process them.

  • Information about the Token Vaults used by masking rules.

Protected Data Domains and Data Consistency

Datasets published to the same PDD will have preserved data consistency when processed by Rules with the Preserve Data Consistency option enabled.

Datasets published to a specific PDD will always be inconsistent with datasets published to a different PDD, even if the same Policies and Jobs were are used to process them.

This isolation means that while data linkability is possible within a single PDD, recipients of different PDDs will be prevented from linking between those PDDs.

Protected Data Domains and Watermarking

Datasets published to the same PDD may optionally contain a unique signature embedded as a hidden Watermark. This allows a file to be traced back to its original PDD.

This is useful, for example, in situations where data that has been shared externally needs to be attributed back to a PDD following a data breach. The PDD contains useful metadata on the file, such as the name of the authorizer and its original recipient.

For more information about this feature, see Watermarking a Dataset.