The enigma of the RGPD application on LLMS • Blog AI

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

In the digital age, data confidentiality is essential concern and regulations such as the General Data Protection Regulations (GDPR) aim to protect personal data from individuals. However, the advent of large language models (LLMS) such as GPT-4, Bert and their parents pose significant challenges to the application of the GDPR. These models, which generate text by predicting the following token based on models of large quantities of training data, intrinsically complicate the regulatory landscape. Here is why the application of the GDPR on the LLM is practically impossible.

The nature of LLM and data storage

To understand the application dilemma, it is essential to understand the functioning of LLM. Unlike traditional databases where data is structured, LLM operates differently. They are trained on massive data sets, and thanks to this training, they adjust millions, even billions of parameters (weight and bias). These parameters capture complex models and knowledge from data but do not store the data itself in a recoverable form.

When an LLM generates text, it does not access a database of stored sentences or sentences. Instead, he uses his parameters learned to predict the most likely word of a sequence. This process is similar to the way in which a human could generate text based on learned language models rather than recalling exact sentences of memory.

The right to be forgotten

One of the rights of the cornerstone under the GDPR is the “right to be forgotten”, allowing individuals to request the deletion of their personal data. In traditional data storage systems, this means locating and erasing specific data inputs. However, with the LLM, the specific identification and deletion of personal data integrated into the model parameters are practically impossible. The data is not explicitly stored but is rather disseminated on countless parameters in a way that cannot be accessible or modified individually.

Erasure of data and model recycling

Even if it was theoretically possible to identify specific data points within an LLM, erasing them would be another monumental challenge. Deleting data from an LLM would require recycling the model, which is an expensive and long process. The recycling of zero to exclude certain data would require the same extended resources initially used, including power and calculation time, which makes it impracticable.

Anonymization and minimization of data

The GDPR also emphasizes anonymization and minimization of data. Although LLM can be trained on anonymized data, ensuring complete anonymization is difficult. Anonymized data can sometimes still reveal personal information when combined with other data, leading to potential re -identification. In addition, LLMs need large amounts of data to operate effectively, in conflict with the principle of data minimization.

Lack of transparency and explanability

Another RGPD requirement is the ability to explain how personal data is used and decisions are made. The LLM, however, are often called “black boxes” because their decision -making processes are not transparent. Understand why a model has generated a particular text consists in deciphering complex interactions between many parameters, a task beyond current technical capacities. This lack of explanability obstructs compliance with the transparency requirements of the GDPR.

Facing: regulatory and technical adaptations

Given these challenges, the application of the GDPR on the LLM requires regulatory and technical adaptations. Regulators must develop directives that explain the unique nature of LLM, potentially focusing on the ethical use of AI and the implementation of robust data protection measures during the formation and deployment of the model.

Technologically, progress in interpretability and model control could help compliance. The techniques to make LLM more transparent and methods to follow the source of data in models are current research areas. In addition, differential confidentiality, which guarantees that deleting or adding a single data point does not significantly affect the output of the model, could be a step towards the alignment of LLM practices with the Principles of the GDPR.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.