- HmbBfDI claims that Large Language Models (LLMs) do not store personal data and are therefore not considered processing within the meaning of Art. 4 No. 2 GDPR.
- Data subject rights (information, erasure, rectification) can relate to the input and output of AI systems, but not to the model itself.
- Training and fine-tuning with personal data must comply with data protection regulations; unlawful training does not automatically affect the use of the model.
- Other supervisory authorities take a different view; a case-by-case examination remains necessary and the debate is still ongoing.
On July 15, 2024, the Hamburg Data Protection Authority (HmbBfDI) published a discussion paper entitled “Large Language Models and Personal Data” (Media release and PDF). The paper is intended as a contribution to the discussion that contains the current state of knowledge on the question of whether Large Language Models (LLMs) store personal data.
The basic theses are as follows:
1. the mere storage of an LLM does not constitute processing within the meaning of Art. 4 No. 2 GDPR. Because No personal data is stored in LLMs. Insofar as personal data is processed in an LLM-supported AI system, the processing operations must comply with the requirements of the GDPR. This applies in particular to the output of such an AI system.
2. due to the lack of storage of personal data in the LLM, the rights of data subjects under the GDPR cannot apply to the model itself. Claims to Information, deletion or correction however, can at least rely on Input and output of an AI system of the responsible provider or operator.
3. the Training of LLMs with personal data must Data protection compliant be carried out. The rights of the data subjects must also be observed. Training that may violate data protection does not affect the legality of the use of such a model in an AI system.
The HmbBfDI first presents the Tokenization as the processing (in the technical sense) of training data, which is broken down into snippets and related to each other, represented by a mathematical function that is both the “knowledge” of the LLM and the basis for the output. Accordingly an LLM does not contain any personal data as such:
If the training data contains personal data, it undergoes a transformation in the machine learning process in which it is converted into abstract mathematical representations. This process of abstraction results in the concrete characteristics and references to specific individuals being lost and general patterns and correlations resulting from the training data as a whole being recorded instead.
This could also be illustrated as follows:

The fact that an LLM relates tokens and that certain results are therefore more likely depending on the context does not escape the attention of the HmbBfDI; however, this is to a certain extent a new creation and not a reproduction. Furthermore, although there are privacy attacks that make training data recognizable, it is “doubtful” that this can be considered personal data. Unlike IP addresses, for example, tokens are not identifiers. The relationship between the tokens is also only a statement about the linguistic function of the individual tokens; individual information cannot be inferred from this.
Models with fine-tuning are more likely to reproduce training data. However, the output of personal data is not “compelling evidence” that personal data has been stored as such; it could also be a coincidence. In addition, privacy attacks are at best a disproportionate effort and a possibly prohibited means, so that there is a lack of determinability.
Because LLMs do not store any personal data, a data protection breach during training does not affect the legality of the use of the LLM. However, when using an LLM and during training or fine-tuning, data protection must of course be observed.
The HmbBfDI’s statement is hardly the last word. In any case, the DSK has left open the possibility (May 2024) that LLMs contain personal data, as did the BayLDA in a Checklist from January 2024 or the BadenWürttembergische LfDI 2023. However, there is no more detailed analysis in each case, only the indication that LLMs may contain personal data and that a case-by-case examination is required. The position of the HmbBfDI is much more resolute here.
Overall, however, the explanations appear to be well-founded. In any case, you often look for this kind of clarification from authorities in vain – perhaps the fear that LLMs could otherwise be de facto prohibited was also the inspiration for the theses. However, the solution to this problem must at best be found through appropriate exceptions to the data protection requirements, and the outcome of the investigations by the EDPB task force on OpenAI is also open:
EDSA: Interim report of the task force on the OpenAI investigations
It would also be possible to argue the opposite. If personal data is stored in encrypted form in a secure environment, it cannot be readily accessed, and third-party attacks are not necessarily more likely than with an LLM; however, no one would claim that it is not personal data. Decryption must of course be possible, and this is where the difference to LLM can be found: Unlike decryption, there is no recoverable 1:1 relationship between content and output. However, an LLM contains the statement (albeit in a complex form) that the tokens “Adri”, “an L”, “obs” “iger” are more closely related to each other than, for example, “Adri” “a” “L” “obs” “ter”. The corresponding tokens cannot be extracted as such, and certainly not side by side, but they can still be queried and the result of their statistical relationships can be determined. If you ask ChatGPT who the FDPIC is, the answer is “The current FDPIC is Adrian Lobsiger, who has held this office since 2016”. Such statements are newly generated and are not direct Reproduction of training data, but that cannot be the point. The fact that ChatGPT was obviously trained with corresponding information and is therefore able to invent a corresponding statement in response to a prompt, i.e. to reproduce the corresponding information, does not change (and whether the resulting statement is factually correct is irrelevant). In other words, it can hardly matter whether a model stores the statement that Adrian Lobsiger is the FDPIC in simple text form or in a very indirect and complex way, but capable of output.