What is a Policy?
A Privitar Policy represents a transformation of input conforming to a specified Schema into an output with privacy-preserving de-identification applied.
The Policy specifies, for each column in the input, the de-identification process (if any) that should be applied to that column. The output is new tables in a new location that consist of the transformed columns.
The techniques used by Privitar to de-identify data are:
Masking
Tokenization
Generalization
Each of these techniques are treated as parts of a Privitar Policy.
After it has been defined, the execution of the Policy against its dataset is controlled by a Job.
Masking
Masking refers to a process where sensitive values are removed/redacted (for example, clipped or substituted) or obscured (for example, perturbed).
Masking can be applied with Batch (Hadoop), Data Flow and Privitar on Demand Jobs, but certain Rule types are only available in specific Job execution modes.
Tokenization
Tokenization refers to replacing raw values with generated tokens.
Privitar supports two types of Tokenization:
Random Tokenization; raw values are replaced with randomly generated tokens.
Derived Tokenization; raw values are replaced with a token derived from the encrypted value of the input.
The process of tokenization may be consistent, meaning that the same input value always results in the same token whenever it is processed by the same Rule within the same Protected Data Domain. This is known as Preserve Data Consistency and is an option that is available with rules that perform tokenization.
This is important when considering relationships in the data. Consistent tokenization is required if a keyed relationship is to be maintained, because the same value (that is, the same random token) must be present in several de-identified tables. For Rules that are configured to mask or tokenize consistently, you might optionally allow for unmasking of original values. That is, enable re-identication of data that was de-identified through that Rule.
If tokenization is not consistent, a new token will be created for each occurrence of a value. This means there will be multiple tokens for the same input value if it occurs more than once in the data.
Tokenization can be applied with Batch (Hadoop), Data Flow and Privitar on Demand Jobs, but certain Rule types are only available in specific Job execution modes.
Generalization
Generalization refers to the process of blurring sensitive values in input data to mitigate re-identification risks and defend against linkage attacks.
Privitar supports two types of generalization processing:
Manual generalization
Automatic generalization
Manual generalization without k-Anonymity constraints can be applied with Batch (Hadoop), Data Flow and Privitar on Demand Jobs. Manual Generalization with k-Anonymity enabled and Automatic generalization can only be applied with Batch Jobs, due to the requirement to check for the mandatory k-Anonymity constraints across all records.