Automatic Generalization Advanced Settings
Automatic Generalization with a minimum cluster size of k ensures that, for any combination of quasi-identifiers that exists in the data, the corresponding cluster of records will be at least k records in size. This characteristic of the output data gives a resistance to re-identification through linkage attacks. Informally, each individual is protected from being re-identified by being hidden in the crowd.
However, there remains a potential vulnerability in k-anonymous data: for columns which contain sensitive information, such as medical diagnoses, salaries or debts, there may not be sufficient diversity. This means that, given a cluster matching a combination of quasi-identifiers, there are very few—or only one—distinct value(s) in the sensitive column. This may constitute a leak of sensitive information because even if an adversary is prevented from knowing which exact record corresponds to an individual, he nonetheless learns the sensitive attribute. Hiding in the crowd is insufficient if the whole crowd has the same secret value.
In the platform, specifying an L-diversity value of 2 or more mitigates this issue for the specified sensitive columns. For each of these columns the algorithm will ensure there is at least L different values in every cluster. In this way, the vulnerability of the k-anonymous data discussed above is eliminated.
When L-diversity is enabled, the process of generalization might introduce an unbalanced distribution of sensitive values. The C-Ratio setting ensures that each of the L values are reasonably balanced. It defines that the combined size of all but the largest L-1 categories must be larger than the size of the largest category divided by the C-Ratio. The effect of this is to ensure that as well as there being L categories, the largest category does not dominate. To understand why this balance is desirable, consider the possibility that a group has a sensitive Boolean column, and the data show one 'Yes' and 99,999 'No's. This very unbalanced ratio allows an adversary to deduce with 99.999% probability that an individual in that group has value "No".
The Bins setting affects the way numeric variables are processed when selected as sensitive columns. Numeric values are first binned into the specified number of bins, then L-diversity is applied as above, with each bin being treated as a sensitive category.