The English data protection supervisory authority ICO has issued a comprehensive background paper on “Big data, artificial intelligence, machine learning and data protection” has been published in a new version. The paper is not conclusive on one of the main questions in this context, namely anonymization (and thus also on the question of when personal data exists), but is quite differentiated:
“Some commentators have pointed to examples of where it has apparently been possible to identify individuals in anonymised datasets, and so concluded that anonymization is becoming increasingly ineffective in the world of big data. On the other hand, Cavoukian and Castro have found shortcomings in the main studies on which this view is based. A recent MIT study looked at records of three months of credit card transactions for 1.1 million people and claimed that, using the dates and locations of four purchases, it was possible to identify 90 percent of the people in the dataset. However, Khalid El Emam has pointed out that, while the researchers were able to identify unique patterns of spending, they did not actually identify any individuals. He also suggested that in practice access to a dataset such as this would be controlled and also that the anonymization techniques applied to the dataset were not particularly sophisticated and could have been improved.
It may not be possible to establish with absolute certainty that an individual cannot be identified from a particular dataset, taken together with other data that may exist elsewhere. The issue is not about eliminating the risk of re-identification altogether, but whether it can be mitigated so it is no longer significant. Organizations should focus on mitigating the risks to the point where the chance of re-identification is extremely remote. The range of datasets available and the power of big data analytics make this more difficult, and the risk should not be underestimated. But that does not make anonymization impossible or in effective.“