On Decem­ber 17, 2024, the Euro­pean Data Pro­tec­tion Board adopted a Opi­ni­on 28/2024 on cer­tain data pro­tec­tion aspects rela­ted to the pro­ce­s­sing of per­so­nal data in the con­text of AI models published.

The Irish super­vi­so­ry aut­ho­ri­ty had asked the EDPB for an opi­ni­on on que­sti­ons of gene­ral importance in con­nec­tion with the pro­ce­s­sing of per­so­nal data in the deve­lo­p­ment and intro­duc­tion pha­se of AI models, in particular:

  • when and how an AI model can be con­side­red “anony­mous”
  • how con­trol­lers can demon­stra­te the legi­ti­ma­te inte­rest as a legal basis in the deve­lo­p­ment and deployment pha­se, and
  • the con­se­quen­ces of unlawful pro­ce­s­sing in the deve­lo­p­ment pha­se on the ope­ra­ti­on of the model.

The some­what leng­thy opi­ni­on does not intend to or can­not ans­wer the­se que­sti­ons exhaus­tively, but it is inten­ded to pro­vi­de the super­vi­so­ry aut­ho­ri­ties with a frame­work. It also igno­res que­sti­ons rela­ting to par­ti­cu­lar­ly sen­si­ti­ve per­so­nal data, auto­ma­ted indi­vi­du­al decis­i­ons, con­for­mi­ty of pur­po­se pur­su­ant to Art. 6 para. 4 GDPR, data pro­tec­tion impact assess­ments and the prin­ci­ple of pri­va­cy by design.

Inte­re­st­ing and con­vin­cing is the core state­ment of the EDSA that an AI system – even an LLM – does not per se is anony­mousbut that it must be checked on a case-by-case basis accor­ding to the known cri­te­ria whe­ther per­so­nal data can be extra­c­ted or dis­c­lo­sed during operation.

It is also clear that the legal­ly com­pli­ant deve­lo­p­ment and use of an LLM is chal­len­ging. This is par­ti­cu­lar­ly true due to the docu­men­ta­ti­on obli­ga­ti­ons and the accoun­ta­bi­li­ty prin­ci­ple, but also becau­se of the high requi­re­ments for trans­pa­ren­cy in the use of per­so­nal data and the Respon­si­bi­li­ty of the ope­ra­tor (deployer) of an AI systemIf a system is not anony­mous, he must appro­pria­te­ly check that the system or model was not deve­lo­ped through unlawful pro­ce­s­sing. It may not be suf­fi­ci­ent to rely on the provider’s decla­ra­ti­on of con­for­mi­ty requi­red under the AI Act.

In the first sec­tion, the EDSA cla­ri­fi­es its Under­stan­ding cer­tain terms such as first-par­ty data (data coll­ec­ted direct­ly) and third-par­ty data (data coll­ec­ted by third par­ties). The under­stan­ding of AI systems (AIS) and AI models (AIM) is also touch­ed on, but unfort­u­n­a­te­ly wit­hout defi­ning the­se terms in more detail in terms of the AI Act (see our FAQ). In the opi­ni­on, howe­ver, it only addres­ses models that are trai­ned with per­so­nal data.

An LLM can con­tain per­so­nal data

On the hot­ly deba­ted issue of whe­ther an AIM, and in par­ti­cu­lar an LLM, con­ta­ins per­so­nal data (see see here), the EDSA says the following:

First of all, cer­tain AIMs are desi­gned to make state­ments about spe­ci­fic peo­p­le – they are cer­tain­ly not anonymous:

some AI models are spe­ci­fi­cal­ly desi­gned to pro­vi­de per­so­nal data regar­ding indi­vi­du­als who­se per­so­nal data were used to train the model, or in some way to make such data available. In the­se cases, such AI models will inher­ent­ly (and typi­cal­ly neces­s­a­ri­ly) include infor­ma­ti­on rela­ting to an iden­ti­fi­ed or iden­ti­fia­ble natu­ral per­son… the­se types of AI models can­not be con­side­red anony­mous. This would be the case, for exam­p­le, (i) of a gene­ra­ti­ve model fine-tun­ed on the voice recor­dings of an indi­vi­du­al to mimic their voice; or (ii) any model desi­gned to rep­ly with per­so­nal data from the trai­ning when prompt­ed for infor­ma­ti­on regar­ding a spe­ci­fic person.

Howe­ver, other AIMs that are not desi­gned for such a pur­po­se are also not fun­da­men­tal­ly anony­mous, becau­se the extra­c­tion of per­so­nal trai­ning data can­not be ruled out. It the­r­e­fo­re depends on the Indi­vi­du­al case on. The decisi­ve fac­tor is the pos­si­bi­li­ty of To deve­lop infor­ma­ti­on con­tent:

… for a SA to agree with the con­trol­ler that a given AI model may be con­side­red anony­mous, it should check at least whe­ther it has recei­ved suf­fi­ci­ent evi­dence that, with rea­sonable means: (i) per­so­nal data, rela­ted to the trai­ning data, can­not be extra­c­ted out of the model; and (ii) any out­put pro­du­ced when query­ing the model does not rela­te to the data sub­jects who­se per­so­nal data was used to train the model.

This pre­su­ma­b­ly requi­res a in-depth exami­na­ti­on taking into account the fol­lo­wing fac­tors in particular:

  • the cha­rac­te­ri­stics of the trai­ning data, the AIM and the trai­ning procedure
  • the con­text of the publi­ca­ti­on or ope­ra­ti­on of the AIM
  • any acce­s­si­ble addi­tio­nal infor­ma­ti­on that enables identification
  • the costs and time requi­red to obtain such addi­tio­nal information
  • the available tech­no­lo­gy and tech­no­lo­gi­cal developments
  • who has access to the AIM
  • Mea­su­res to safe­guard anonymity

The pecu­lia­ri­ties of the AIM must be exami­ned, first of all Que­sti­ons of design:

  • The input data used
  • the pro­ce­s­sing of this data, inclu­ding any pseud­ony­mizati­on or fil­te­ring of per­so­nal data pri­or to training
  • the deve­lo­p­ment pro­ce­du­re, in par­ti­cu­lar pri­va­cy-pre­ser­ving tech­ni­ques such as dif­fe­ren­ti­al privacy
  • Mea­su­res in the model its­elf that can help to redu­ce the extra­c­tion of per­so­nal data

At the level of Gover­nan­ce at the deve­lo­per it must then be con­side­red whe­ther the mea­su­res taken have been robust­ly imple­men­ted and tested. Final­ly, it is also neces­sa­ry to exami­ne how the AIM tested and gene­ral­ly the Docu­men­ta­ti­on by the deve­lo­per; fur­ther infor­ma­ti­on on their sub­ject mat­ter can be found in the Opinion.

Legi­ti­ma­te interest

The EDPB first recalls the gene­ral prin­ci­ples and requi­re­ments of the GDPR, inso­far as a per­so­nal refe­rence is not exclu­ded, in par­ti­cu­lar the que­sti­on of trans­pa­ren­cy or infor­ma­ti­on and pur­po­se limi­ta­ti­on. With regard to the legal basis of legi­ti­ma­te inte­rest (Art. 6 para. 1 lit. f GDPR), the EDPB points out that it a prio­ri can justi­fy only tho­se pro­ce­s­sing ope­ra­ti­ons that are neces­sa­ry to achie­ve the inte­rest, which is a Pro­por­tio­na­li­ty test (see also ECJ, Case C‑621/22).

In the end, the inte­rests are weig­hed up, and here the refe­ren­ces to the con­text of an LLM remain vague. Howe­ver, the EDPB men­ti­ons risks in a lar­ge-sca­le trai­ning (he is thin­king of scraping):

For exam­p­le, lar­ge-sca­le and indis­cri­mi­na­te data coll­ec­tion by AI models in the deve­lo­p­ment pha­se may crea­te a sen­se of sur­veil­lan­ce for data sub­jects, espe­ci­al­ly con­side­ring the dif­fi­cul­ties to pre­vent public data from being scraped. This may lead indi­vi­du­als to self-cen­sor, and pre­sent risks of under­mi­ning their free­dom of expression […].

At Use The pur­po­se of an AIM and then an AIS must be taken into account; poten­ti­al­ly sen­si­ti­ve are, for exam­p­le, fil­ter or recom­men­der systems, systems that can impair access to work or have a dis­cri­mi­na­to­ry effect, and systems that are even used with mali­cious intent.

Howe­ver, it must also be taken into account that a AIS has a posi­ti­ve effect for exam­p­le, if it remo­ves harmful con­tent or faci­li­ta­tes access to information.

The EDPB men­ti­ons other fac­tors that should be taken into account, such as the type of data or its scope and the expec­ta­ti­ons of the data sub­jects, but remains rather vague. At least one point is interesting:

The Expec­ta­ti­ons of tho­se affec­ted can influen­ced by a pri­va­cy poli­cy beco­me. This was announ­ced by the Ger­man Data Pro­tec­tion Con­fe­rence in the Gui­dance on direct mar­ke­ting The GDPR is pro­ba­b­ly view­ed some­what more strict­ly (“the expec­ta­ti­ons of the data sub­ject can­not be exten­ded by the man­da­to­ry infor­ma­ti­on pro­vi­ded for in the GDPR”). Howe­ver, it is not neces­s­a­ri­ly suf­fi­ci­ent to refer to the pos­si­bi­li­ty of using per­so­nal data for trai­ning pur­po­ses in a pri­va­cy poli­cy. For exam­p­le, data sub­jects are not neces­s­a­ri­ly awa­re that per­so­nal data is used to adapt the respon­ses of an AIS to their needs and to offer cus­to­mi­zed ser­vices – in other words, the EDPB expects a litt­le more con­text in the data pro­tec­tion information.

Miti­ga­ti­on measures

Final­ly, the EDPB lists mea­su­res – some of which are red­un­dant – that can redu­ce the risks for tho­se affected:

  • Tech­ni­cal mea­su­res that ide­al­ly even crea­te anonymity 
    • Mea­su­res at the level of input data and model design
    • Pseud­ony­mizati­on
    • Mas­king (repla­ce­ment with fic­ti­tious data, e.g. fake names)
  • Mea­su­res to pro­tect the rights of data subjects: 
    • Time inter­val bet­ween coll­ec­tion and use of per­so­nal data
    • Opt-out right
    • Gran­ting the right to era­su­re out­side of Art. 17 GDPR
    • Mea­su­res to “unlearn” per­so­nal data
  • Trans­pa­ren­cy:
    • Addi­tio­nal infor­ma­ti­on on data sources and selection
    • Infor­ma­ti­on, e.g. also via media cam­paigns, visu­al pre­sen­ta­ti­ons, FAQs and trans­pa­ren­cy reports
  • for web scraping: 
    • Exclu­si­on of sen­si­ti­ve data
    • Exclu­si­on of data from sen­si­ti­ve websites
    • Auto­ma­ted con­side­ra­ti­on of scra­ping contradictions
    • Time and source-based rest­ric­tions on data collection
    • Opt-out right through cor­re­spon­ding lists
  • in ope­ra­ti­on:
    • Pro­tec­tion against the repro­duc­tion of per­so­nal data through filters
    • Pro­tec­tion against reu­se (e.g. through watermarking)
    • Faci­li­ta­ti­on of data sub­ject rights (dele­ti­on and rem­oval of per­so­nal data)

Effect of a lack of legal basis in the trai­ning phase

In a fur­ther sec­tion, the EDPB addres­ses the que­sti­on of whe­ther and how the lack of a legal basis – in the trai­ning pha­se – affects down­stream ope­ra­ti­ons. The EDPB distin­gu­is­hes bet­ween dif­fe­rent scenarios:

Sce­na­rio 1 – Use by the same respon­si­ble per­son:

  • If a con­trol­ler unlawful­ly uses per­so­nal data for the deve­lo­p­ment of an AIM and sub­se­quent­ly uses the data in the model its­elf, e.g. when pro­vi­ding the model, it must be deter­mi­ned on a case-by-case basis. ask whe­ther the deve­lo­p­ment and ope­ra­ting pha­ses have sepa­ra­te pur­po­ses. and the­r­e­fo­re con­sti­tu­te sepa­ra­te pro­ce­s­sing activities.
  • In the case of a sepa­ra­te con­side­ra­ti­on, the unlawful­ness of the first pro­ce­s­sing must be “taken into account” when exami­ning the legi­ti­ma­te inte­rest in the ope­ra­tio­nal pha­se – the EDPB the­r­e­fo­re does not impo­se a per se ban.

Sce­na­rio 2 – Fur­ther pro­ce­s­sing by ano­ther con­trol­ler:

  • The roles and respon­si­bi­li­ties of the par­ties must first be cle­ar­ly defi­ned and joint respon­si­bi­li­ty must be exami­ned. This must be con­trac­tual­ly regulated.
  • The effect of unlawful pro­ce­s­sing during trai­ning must also be exami­ned on a case-by-case basis. The second per­son respon­si­ble must check appro­pria­te­ly (Accoun­ta­bi­li­ty) that the AIM does not deve­lo­ped through unlawful pro­ce­s­sing was. In other words, the EDPB impo­ses what is undoub­ted­ly a deman­ding veri­fi­ca­ti­on task on the provider’s cus­to­mer in prac­ti­ce. In par­ti­cu­lar it may not be suf­fi­ci­ent to rely only on the decla­ra­ti­on of con­for­mi­ty requi­red under the AI Act (!).

Sce­na­rio 3 – Unlawful deve­lo­p­ment and sub­se­quent anony­mizati­on and pro­ce­s­sing by the same or ano­ther controller: 

  • If the model is tru­ly anony­mous, the GDPR does not app­ly to it.
  • The GDPR applies if per­so­nal data is sub­se­quent­ly pro­ce­s­sed again. Howe­ver, any ori­gi­nal unlawful­ness does not affect this new processing.