On December 17, 2024, the European Data Protection Board adopted a Opinion 28/2024 on certain data protection aspects related to the processing of personal data in the context of AI models published.
The Irish supervisory authority had asked the EDPB for an opinion on questions of general importance in connection with the processing of personal data in the development and introduction phase of AI models, in particular:
- when and how an AI model can be considered “anonymous”
- how controllers can demonstrate the legitimate interest as a legal basis in the development and deployment phase, and
- the consequences of unlawful processing in the development phase on the operation of the model.
The somewhat lengthy opinion does not intend to or cannot answer these questions exhaustively, but it is intended to provide the supervisory authorities with a framework. It also ignores questions relating to particularly sensitive personal data, automated individual decisions, conformity of purpose pursuant to Art. 6 para. 4 GDPR, data protection impact assessments and the principle of privacy by design.
Interesting and convincing is the core statement of the EDSA that an AI system – even an LLM – does not per se is anonymousbut that it must be checked on a case-by-case basis according to the known criteria whether personal data can be extracted or disclosed during operation.
It is also clear that the legally compliant development and use of an LLM is challenging. This is particularly true due to the documentation obligations and the accountability principle, but also because of the high requirements for transparency in the use of personal data and the Responsibility of the operator (deployer) of an AI systemIf a system is not anonymous, he must appropriately check that the system or model was not developed through unlawful processing. It may not be sufficient to rely on the provider’s declaration of conformity required under the AI Act.
In the first section, the EDSA clarifies its Understanding certain terms such as first-party data (data collected directly) and third-party data (data collected by third parties). The understanding of AI systems (AIS) and AI models (AIM) is also touched on, but unfortunately without defining these terms in more detail in terms of the AI Act (see our FAQ). In the opinion, however, it only addresses models that are trained with personal data.
An LLM can contain personal data
On the hotly debated issue of whether an AIM, and in particular an LLM, contains personal data (see see here), the EDSA says the following:
First of all, certain AIMs are designed to make statements about specific people – they are certainly not anonymous:
… some AI models are specifically designed to provide personal data regarding individuals whose personal data were used to train the model, or in some way to make such data available. In these cases, such AI models will inherently (and typically necessarily) include information relating to an identified or identifiable natural person… these types of AI models cannot be considered anonymous. This would be the case, for example, (i) of a generative model fine-tuned on the voice recordings of an individual to mimic their voice; or (ii) any model designed to reply with personal data from the training when prompted for information regarding a specific person.
However, other AIMs that are not designed for such a purpose are also not fundamentally anonymous, because the extraction of personal training data cannot be ruled out. It therefore depends on the Individual case on. The decisive factor is the possibility of To develop information content:
… for a SA to agree with the controller that a given AI model may be considered anonymous, it should check at least whether it has received sufficient evidence that, with reasonable means: (i) personal data, related to the training data, cannot be extracted out of the model; and (ii) any output produced when querying the model does not relate to the data subjects whose personal data was used to train the model.
This presumably requires a in-depth examination taking into account the following factors in particular:
- the characteristics of the training data, the AIM and the training procedure
- the context of the publication or operation of the AIM
- any accessible additional information that enables identification
- the costs and time required to obtain such additional information
- the available technology and technological developments
- who has access to the AIM
- Measures to safeguard anonymity
The peculiarities of the AIM must be examined, first of all Questions of design:
- The input data used
- the processing of this data, including any pseudonymization or filtering of personal data prior to training
- the development procedure, in particular privacy-preserving techniques such as differential privacy
- Measures in the model itself that can help to reduce the extraction of personal data
At the level of Governance at the developer it must then be considered whether the measures taken have been robustly implemented and tested. Finally, it is also necessary to examine how the AIM tested and generally the Documentation by the developer; further information on their subject matter can be found in the Opinion.
Legitimate interest
The EDPB first recalls the general principles and requirements of the GDPR, insofar as a personal reference is not excluded, in particular the question of transparency or information and purpose limitation. With regard to the legal basis of legitimate interest (Art. 6 para. 1 lit. f GDPR), the EDPB points out that it a priori can justify only those processing operations that are necessary to achieve the interest, which is a Proportionality test (see also ECJ, Case C‑621/22).
In the end, the interests are weighed up, and here the references to the context of an LLM remain vague. However, the EDPB mentions risks in a large-scale training (he is thinking of scraping):
For example, large-scale and indiscriminate data collection by AI models in the development phase may create a sense of surveillance for data subjects, especially considering the difficulties to prevent public data from being scraped. This may lead individuals to self-censor, and present risks of undermining their freedom of expression […].
At Use The purpose of an AIM and then an AIS must be taken into account; potentially sensitive are, for example, filter or recommender systems, systems that can impair access to work or have a discriminatory effect, and systems that are even used with malicious intent.
However, it must also be taken into account that a AIS has a positive effect for example, if it removes harmful content or facilitates access to information.
The EDPB mentions other factors that should be taken into account, such as the type of data or its scope and the expectations of the data subjects, but remains rather vague. At least one point is interesting:
The Expectations of those affected can influenced by a privacy policy become. This was announced by the German Data Protection Conference in the Guidance on direct marketing The GDPR is probably viewed somewhat more strictly (“the expectations of the data subject cannot be extended by the mandatory information provided for in the GDPR”). However, it is not necessarily sufficient to refer to the possibility of using personal data for training purposes in a privacy policy. For example, data subjects are not necessarily aware that personal data is used to adapt the responses of an AIS to their needs and to offer customized services – in other words, the EDPB expects a little more context in the data protection information.
Mitigation measures
Finally, the EDPB lists measures – some of which are redundant – that can reduce the risks for those affected:
- Technical measures that ideally even create anonymity
- Measures at the level of input data and model design
- Pseudonymization
- Masking (replacement with fictitious data, e.g. fake names)
- Measures to protect the rights of data subjects:
- Time interval between collection and use of personal data
- Opt-out right
- Granting the right to erasure outside of Art. 17 GDPR
- Measures to “unlearn” personal data
- Transparency:
- Additional information on data sources and selection
- Information, e.g. also via media campaigns, visual presentations, FAQs and transparency reports
- for web scraping:
- Exclusion of sensitive data
- Exclusion of data from sensitive websites
- Automated consideration of scraping contradictions
- Time and source-based restrictions on data collection
- Opt-out right through corresponding lists
- in operation:
- Protection against the reproduction of personal data through filters
- Protection against reuse (e.g. through watermarking)
- Facilitation of data subject rights (deletion and removal of personal data)
Effect of a lack of legal basis in the training phase
In a further section, the EDPB addresses the question of whether and how the lack of a legal basis – in the training phase – affects downstream operations. The EDPB distinguishes between different scenarios:
Scenario 1 – Use by the same responsible person:
- If a controller unlawfully uses personal data for the development of an AIM and subsequently uses the data in the model itself, e.g. when providing the model, it must be determined on a case-by-case basis. ask whether the development and operating phases have separate purposes. and therefore constitute separate processing activities.
- In the case of a separate consideration, the unlawfulness of the first processing must be “taken into account” when examining the legitimate interest in the operational phase – the EDPB therefore does not impose a per se ban.
Scenario 2 – Further processing by another controller:
- The roles and responsibilities of the parties must first be clearly defined and joint responsibility must be examined. This must be contractually regulated.
- The effect of unlawful processing during training must also be examined on a case-by-case basis. The second person responsible must check appropriately (Accountability) that the AIM does not developed through unlawful processing was. In other words, the EDPB imposes what is undoubtedly a demanding verification task on the provider’s customer in practice. In particular it may not be sufficient to rely only on the declaration of conformity required under the AI Act (!).
Scenario 3 – Unlawful development and subsequent anonymization and processing by the same or another controller:
- If the model is truly anonymous, the GDPR does not apply to it.
- The GDPR applies if personal data is subsequently processed again. However, any original unlawfulness does not affect this new processing.