Introduction: The challenge of memorization in language models
Modern language models face an in -depth examination concerning their memorization behavior. With models such as a parameter transformer of 8 billion formed out of 15 bowls of tokens, the researchers wonder if these models memorize their training data significantly. The common techniques, including the extraction of data and the inference of membership, are not short because they often manage to distinguish memorization and generalization.
Limits of existing approaches
Previous executives such as methods based on differential extraction or confidentiality operate at the level of the data set, and not by accounting by specific memorization. The modeling of compression language and evaluations of capacity by the memorization of facts (as in RNNs and quantified transformers) offer partial insight but lack of scalability and precision, in particular for the architectures of deep transformers.
A new approach to measure memorization
Researchers from Fair At Meta, Google Deepmind, Cornell University and Nvidia have proposed a new method to estimate how a “know” model of specific data points to measure the capacity of modern language models. They separate memorization into two components: involuntary memorization, which represents the information that a model contains on a set of data, and generalization, which captures information on the real data generation process. They calculate total memorization to provide precise estimates of the capacity of the model by removing generalization, showing that the models of the GPT family have an approximate capacity of 3.6 bits per parameter. Researchers have also developed a series of scaling laws which connect the capacity of the model and the size of the data to adherence inference by forming hundreds of transformative language models.
Experimental framework and training methodology
Using GPT-2 architecture, the team has formed hundreds of models ranging from 100,000 to 20m, variable depths (1-8 layers) and hidden sizes (32-512). Training involved:
- 10 ^ 6 steps
- Lot size: 2048
- Precision: BFLOAT16
- Material: unique A100 GPU
These models have been formed both on synthetic sequences and text sequences that are added to 64 years from the Fineweb data set. The experiments have provided minimal interference of generalization thanks to a careful construction of the data set.
Capacity and key results model
- Bits per parameter: Through configurations, the models are systematically stored between 3.5 and 3.6 bits / parameter.
- Double descent: As the size of the training data ensembles the model's capacity, the loss of test initially decreases (over-adjustment), then improves again when the models start to generalize.
- Impact of precision: Training in Float32 slightly increases storage capacity (at ~ 3.83 BPP) compared to BFLOAT16 (~ 3.51 BPP).
Memorization and generalization of disassembly
Petharding sets of synthetic data with the real text, the team observed:
- The unwanted memorization at the sample increases with the number of parameters.
- Memorization decreases as the size of the training set increases.
- A precise estimate of the memorization of the model requires a deduplication and a reference to an Oracle model for basic compression rates.
Laws of scaling membership members
The researchers have modeled the success rate (F1 score) of the adhesion inference based on the loss according to the ratio between the capacity of the model and the size of the data set. Key observations:
- Adhesion inference becomes unreliable as data sets are increasing.
- The predictive scaling laws remain precise at less than 1 to 2% for models up to 1.5 billion parameters.
Conclusion: a better understanding of the behavior of the model
This work establishes a framework of principle to measure memorization in language models. By introducing quantifiable metrics and evolutionary experiences, it deepens our understanding of the way in which transformer models code training data and establishes a clear border between memorization and generalization. The resulting ideas can guide future developments in the evaluation of models, confidentiality and interpretability.
Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 99K + ML Subreddit and subscribe to Our newsletter.
▶ Do you want to promote your product / webinar / service to 1 million engineers / developers / developers / data scientists / architects / CTO / CIO? Allows you to associate.
Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.
