Skip to main content

User Guide

What is k-anonymity?

In order to understand the risks of linkage attacks and re-identification, the following definitions are useful:

  • A direct identifier in a dataset is a single column that contains an unambiguous reference to an individual. For example, a social security number.

  • A quasi-identifier in a dataset is a set of columns that, taken together, can be used to reference individuals. In a dataset, given a value for each of the quasi-identifier columns, it may be possible to identify a single individual, which is undesirable in a privacy context.

The Privitar Platform can mitigate this by ensuring that, after processing, no fewer than a Minimum Cluster Size (also called k) of records have any particular combination of values for the quasi-identifier columns. If quasi-identifier values corresponding to a specific individual are known, it is no longer possible to identify which row in the cluster of records with those quasi-identifier values belongs to that individual; the most specific that can be achieved is a set of rows of at least the Minimum Cluster Size. This is known as  k-anonymity.

Privitar provides two approaches to achieving k-anonymity:

  • Manual Generalization

  • Automatic Generalization

Manual Generalization

Manual Generalization is an approach to mitigating re-identification attacks by blurring data. This means that an exact value is replaced by a less precise value, either by binning values in the case of numbers and dates, or truncating values in the case of text. The Privitar Platform allows a binning or truncation strategy to be specified as part of the Policy.

Minimum Cluster Size

When the Minimum Cluster Size option is enabled, Privitar will drop clusters of rows that are smaller than the minimum size. Alternatively it is possible to use automatic generalization to automatically adjust the generalization parameters which avoid dropping rows.

Automatic Generalization

Automatic Generalization is a second generalization approach supported by the Privitar Platform to mitigate re-identification risks. Like Manual Generalization, it blurs a selected set of attributes in order to defend against linkage attacks and re-identification of individuals in data sets.

In this context, blurring refers to transformations that remove detail from values, such as binning, preserving only the most frequent values for each column, or combining granular records into broader hierarchical categories.

Comparison of Manual and Automatic Generalization

When Automatic Generalization is configured, an algorithm is used to determine a data transformation that achieves k-anonymity and preserves data utility without dropping any records. It dynamically determines an appropriate degree of blurring for each record depending on the input data and the required level of privacy protection.

In contrast, Manual Generalization acts in a strictly uniform way, blurring all input data to the same degree, and can drop rows as required to achieve k-anonymity.

Manual Generalization may be the right choice if, for example, a fixed blurring is required in the output data, such as a specific numeric binning or rounding a decimal value to a given precision.

Job types compatibility

Enforcing k-anonymity constraints requires Privitar to process the dataset as a whole unit. For this reason Policies with Automatic Generalization strategies can only be applied with Batch Jobs.