In the pre-training of LLM, the quality of the training data is crucial to determine the performance of the model. A common strategy is to filter the toxic content of the training corpus to minimize harmful results. Although this approach aligns with the principle that neural networks reflect their training data, it introduces a compromise. Deleting toxic content can reduce the diversity and wealth of data, potentially weakening the model's ability to understand or identify toxicity and degrading performance in downstream tasks as the answer to questions. This creates a dilemma: keep too much toxic data increases the harmful results, while excessive filtering restricts the overall capacities of the model. However, with the growing accent on post-training interventions, fewer models are deployed directly after pre-training, which suggests that data quality and amount of quantity can be managed more effectively at subsequent stages.
The approaches to detoxify LLMs are generally distributed in two categories: based on finetunings and based on decoding. Funding methods, such as learning to strengthen human feedback (RLHF) and optimization of direct preferences (DPO), align the behavior of the model with human values or organized data sets. Although they are effective, they often compromise the original model capabilities and can be bypassed or defeated thanks to additional training. Controlled generation techniques, on the other hand, adjust outings during inference, using methods such as vocabulary change, self-débiasing or models of external experts. These strategies can reduce toxicity but often cause high calculation costs and alter language mastery. A new work line explores the modification of internal representations, assuming that linear structures in hidden states can be manipulated for specific behavior results.
Researchers from Harvard University are reassessed the quality of data in LLM training by exploring a co-design approach that integrates pre- and after training. They note that pre-training on toxic data, while increasing the toxicity of the basic model, improves the internal representation of the toxicity of the model, which facilitates deletion during post-training. Using OLMO-1B models formed on varied mixtures of clean and toxic data, they show that toxicity becomes more separable linearly and easier to control. Experiences with an incentive and inference intervention reveal an improvement in detoxification without compromising general performance, which suggests that the incorporation of toxic data can lead to more controllable and robust language models.
To study the effects of toxic data on the pre-training of LLM, the researchers have formed a series of OLMO-1B models with increasing proportions of toxic content (from 0% to 25%) while keeping clean constant data. They found that the moderate inclusion of toxic data improves the capacity of the general language (measured by MMLU) and toxicity detection (via toxigen). The survey experiences have revealed that the models formed with toxic data formed stronger and more separable internal representations of toxicity. Statistical analysis and visualization at the tokens have also confirmed that such models more precisely identify the toxic content, arguing that exposure to poisonous examples improves the learning of the concept without significantly harming the general performance.
The study explores whether exposure to toxic data during pre-training can improve the capacity of a model to be detoxified by post-training methods. Using the intervention in an inference time (ITI), the incentive, the supervised finetuning (SFT) and the DPO, the researchers note that the models formed with up to 10% of toxic data (for example, 4chan) show an improvement in alignability. These models react better to detoxification techniques, achieving lower toxicity with minimum performance loss. In addition, when it was tested against adversar red equipment attacks, pre-trained models with toxic data. They directed by using ITI showed greater robustness, which indicates that such an exposure can improve the internal representation of the harmful content model.

In conclusion, the study revisits the hypothesis that the exclusion of toxic data during pre-training improves the quality of the language model. Thanks to theoretical and empirical analyzes using OLMO-1B models, the authors show that the increase in toxic data in pre-training leads to more disadvantaged toxicity representations, which facilitates control during post-training. While the basic models formed on toxic data generate more harmful content initially, detoxification techniques like ITI are more effective on them. The results on reference data sets show a better balance between toxicity reduction and the conservation of general capacities. The work suggests that some “bad” data can improve model management and alignment.
Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 90K + ML Subdreddit.
Here is a brief overview of what we build on Marktechpost:
Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.
