Take-Aways (AI)
  • HmbBfDI claims that Lar­ge Lan­guage Models (LLMs) do not store per­so­nal data and are the­r­e­fo­re not con­side­red pro­ce­s­sing within the mea­ning of Art. 4 No. 2 GDPR.
  • Data sub­ject rights (infor­ma­ti­on, era­su­re, rec­ti­fi­ca­ti­on) can rela­te to the input and out­put of AI systems, but not to the model itself.
  • Trai­ning and fine-tuning with per­so­nal data must com­ply with data pro­tec­tion regu­la­ti­ons; unlawful trai­ning does not auto­ma­ti­cal­ly affect the use of the model.
  • Other super­vi­so­ry aut­ho­ri­ties take a dif­fe­rent view; a case-by-case exami­na­ti­on remains neces­sa­ry and the deba­te is still ongoing.

On July 15, 2024, the Ham­burg Data Pro­tec­tion Aut­ho­ri­ty (HmbBfDI) published a dis­cus­sion paper entit­led “Lar­ge Lan­guage Models and Per­so­nal Data” (Media release and PDF). The paper is inten­ded as a con­tri­bu­ti­on to the dis­cus­sion that con­ta­ins the cur­rent sta­te of know­ledge on the que­sti­on of whe­ther Lar­ge Lan­guage Models (LLMs) store per­so­nal data.

The basic the­ses are as follows:

1. the mere sto­rage of an LLM does not con­sti­tu­te pro­ce­s­sing within the mea­ning of Art. 4 No. 2 GDPR. Becau­se No per­so­nal data is stored in LLMs. Inso­far as per­so­nal data is pro­ce­s­sed in an LLM-sup­port­ed AI system, the pro­ce­s­sing ope­ra­ti­ons must com­ply with the requi­re­ments of the GDPR. This applies in par­ti­cu­lar to the out­put of such an AI system.

2. due to the lack of sto­rage of per­so­nal data in the LLM, the rights of data sub­jects under the GDPR can­not app­ly to the model its­elf. Claims to Infor­ma­ti­on, dele­ti­on or cor­rec­tion howe­ver, can at least rely on Input and out­put of an AI system of the respon­si­ble pro­vi­der or operator.

3. the Trai­ning of LLMs with per­so­nal data must Data pro­tec­tion com­pli­ant be car­ri­ed out. The rights of the data sub­jects must also be obser­ved. Trai­ning that may vio­la­te data pro­tec­tion does not affect the lega­li­ty of the use of such a model in an AI system.

The HmbBfDI first pres­ents the Toke­nizati­on as the pro­ce­s­sing (in the tech­ni­cal sen­se) of trai­ning data, which is bro­ken down into snip­pets and rela­ted to each other, repre­sen­ted by a mathe­ma­ti­cal func­tion that is both the “know­ledge” of the LLM and the basis for the out­put. Accor­din­gly an LLM does not con­tain any per­so­nal data as such:

If the trai­ning data con­ta­ins per­so­nal data, it under­goes a trans­for­ma­ti­on in the machi­ne lear­ning pro­cess in which it is con­ver­ted into abstract mathe­ma­ti­cal repre­sen­ta­ti­ons. This pro­cess of abstrac­tion results in the con­cre­te cha­rac­te­ri­stics and refe­ren­ces to spe­ci­fic indi­vi­du­als being lost and gene­ral pat­terns and cor­re­la­ti­ons resul­ting from the trai­ning data as a who­le being recor­ded instead.

This could also be illu­stra­ted as follows:

The fact that an LLM rela­tes tokens and that cer­tain results are the­r­e­fo­re more likely depen­ding on the con­text does not escape the atten­ti­on of the HmbBfDI; howe­ver, this is to a cer­tain ext­ent a new crea­ti­on and not a repro­duc­tion. Fur­ther­mo­re, alt­hough the­re are pri­va­cy attacks that make trai­ning data reco­gnizable, it is “doubtful” that this can be con­side­red per­so­nal data. Unli­ke IP addres­ses, for exam­p­le, tokens are not iden­ti­fiers. The rela­ti­on­ship bet­ween the tokens is also only a state­ment about the lin­gu­istic func­tion of the indi­vi­du­al tokens; indi­vi­du­al infor­ma­ti­on can­not be infer­red from this.

Models with fine-tuning are more likely to repro­du­ce trai­ning data. Howe­ver, the out­put of per­so­nal data is not “com­pel­ling evi­dence” that per­so­nal data has been stored as such; it could also be a coin­ci­dence. In addi­ti­on, pri­va­cy attacks are at best a dis­pro­por­tio­na­te effort and a pos­si­bly pro­hi­bi­ted means, so that the­re is a lack of determinability.

Becau­se LLMs do not store any per­so­nal data, a data pro­tec­tion breach during trai­ning does not affect the lega­li­ty of the use of the LLM. Howe­ver, when using an LLM and during trai­ning or fine-tuning, data pro­tec­tion must of cour­se be observed.

The HmbBfDI’s state­ment is hard­ly the last word. In any case, the DSK has left open the pos­si­bi­li­ty (May 2024) that LLMs con­tain per­so­nal data, as did the BayL­DA in a Check­list from Janu­ary 2024 or the BadenWürt­tem­ber­gi­sche LfDI 2023. Howe­ver, the­re is no more detail­ed ana­ly­sis in each case, only the indi­ca­ti­on that LLMs may con­tain per­so­nal data and that a case-by-case exami­na­ti­on is requi­red. The posi­ti­on of the HmbBfDI is much more reso­lu­te here.

Over­all, howe­ver, the expl­ana­ti­ons appear to be well-foun­ded. In any case, you often look for this kind of cla­ri­fi­ca­ti­on from aut­ho­ri­ties in vain – per­haps the fear that LLMs could other­wi­se be de fac­to pro­hi­bi­ted was also the inspi­ra­ti­on for the the­ses. Howe­ver, the solu­ti­on to this pro­blem must at best be found through appro­pria­te excep­ti­ons to the data pro­tec­tion requi­re­ments, and the out­co­me of the inve­sti­ga­ti­ons by the EDPB task force on Ope­nAI is also open:

EDSA: Inte­rim report of the task force on the Ope­nAI investigations

It would also be pos­si­ble to argue the oppo­si­te. If per­so­nal data is stored in encrypt­ed form in a secu­re envi­ron­ment, it can­not be rea­di­ly acce­s­sed, and third-par­ty attacks are not neces­s­a­ri­ly more likely than with an LLM; howe­ver, no one would cla­im that it is not per­so­nal data. Decryp­ti­on must of cour­se be pos­si­ble, and this is whe­re the dif­fe­rence to LLM can be found: Unli­ke decryp­ti­on, the­re is no reco­vera­ble 1:1 rela­ti­on­ship bet­ween con­tent and out­put. Howe­ver, an LLM con­ta­ins the state­ment (albeit in a com­plex form) that the tokens “Adri”, “an L”, “obs” “iger” are more clo­se­ly rela­ted to each other than, for exam­p­le, “Adri” “a” “L” “obs” “ter”. The cor­re­spon­ding tokens can­not be extra­c­ted as such, and cer­tain­ly not side by side, but they can still be queried and the result of their sta­tis­ti­cal rela­ti­on­ships can be deter­mi­ned. If you ask ChatGPT who the FDPIC is, the ans­wer is “The cur­rent FDPIC is Adri­an Lob­si­ger, who has held this office sin­ce 2016”. Such state­ments are new­ly gene­ra­ted and are not direct Repro­duc­tion of trai­ning data, but that can­not be the point. The fact that ChatGPT was obvious­ly trai­ned with cor­re­spon­ding infor­ma­ti­on and is the­r­e­fo­re able to invent a cor­re­spon­ding state­ment in respon­se to a prompt, i.e. to repro­du­ce the cor­re­spon­ding infor­ma­ti­on, does not chan­ge (and whe­ther the resul­ting state­ment is fac­tual­ly cor­rect is irrele­vant). In other words, it can hard­ly mat­ter whe­ther a model stores the state­ment that Adri­an Lob­si­ger is the FDPIC in simp­le text form or in a very indi­rect and com­plex way, but capa­ble of output.