What is a PDD?
A Protected Data Domain (PDD) is a logical collection of datasets, brought together for a specific purpose, with strict linkage isolation from datasets in other PDDs.
A PDD should be created whenever datasets are required for a particular use, such as an analysis project, a development/testing activity, or sharing with an external party. The PDD definition remains in the platform, allowing data to be added incrementally over time.
Processing in the platform always produces output in a destination PDD. The PDD is selected when a de-identification Job is used.
In the platform, PDDs contain the following information:
Metadata about the specific intended usage of the de-identified data. For example, the recipient and purpose of the data release.
The output location for de-identified data, if a Hadoop cluster is configured in the Environment. This may be a location in HDFS and/or a Hive database.
Metadata on the published datasets, including the Policies and Jobs used to process them.
Information about the Token Vaults used by masking rules.
Protected Data Domains and Data Consistency
Datasets published to the same PDD will have preserved data consistency when processed by Rules with the Preserve Data Consistency option enabled.
Datasets published to a specific PDD will always be inconsistent with datasets published to a different PDD, even if the same Policies and Jobs were are used to process them.
This isolation means that while data linkability is possible within a single PDD, recipients of different PDDs will be prevented from linking between those PDDs.
Protected Data Domains and Watermarking
Datasets published to the same PDD may optionally contain a unique signature embedded as a hidden Watermark. This allows a file to be traced back to its original PDD.
This is useful, for example, in situations where data that has been shared externally needs to be attributed back to a PDD following a data breach. The PDD contains useful metadata on the file, such as the name of the authorizer and its original recipient.
Warning
When running data flow or POD jobs, the PDD is still editable after you have processed data into that PDD. If watermarking was previously disabled, and you enable it after you processed data, this could make it harder to investigate the watermark when some of the data doesn’t have a watermark.
For more information about this feature, see Watermarking a Dataset.