The quality of the data used in LLMS pre-training has become more and more critical for their success. To build corpus rich in information, the researchers have passed from heuristic filtering methods, such as the elimination of noise -based noise and deduplication, to the filter motivated by the model, which operates neural classifiers to identify high quality samples. Despite its advantages, this approach is always faced with key problems: there is a lack of effective validation mechanisms to quickly assess data quality and often rests on manually organized seed data sets that introduce subjectivity. While the first data sets like C4 and Pile made the basis for model development, recent efforts such as Refinedweb, Dolma and DCLM have evolved considerably, incorporating up to billions of tokens. The filtering focused on the model has gained ground in these new corpus for its ability to refine massive data sets and improve LLM performance through downstream tasks.
However, the efficiency of filtering controlled by the model is limited by high costs and ineffectiveness of current validation methods and the absence of clear standards for selecting seed data. Recent data sets, such as Fineweb-Edu and Ultra-Fineweb, have demonstrated improved model performance using several classifiers to transform data quality. These data sets surpass the previous versions on benchmarks like MMLU, ARC and C-Eval, indicating that refined filtering methods can improve English and Chinese understanding. To further optimize this process, some studies offer the use of LLMS for the evaluation of multidimensional data via prompts or to take advantage of perplexity scores in the tokens. These innovations aim to reduce general calculation costs while improving the quality of the data, ultimately allowing more efficient training with fewer tokens.
Modelbest Inc. researchers, Tsinghua University and Soochow University have developed an efficient data filter pipeline to improve LLM training. They introduced a verification strategy that uses an almost trained LLM to assess new data by observing performance gains during the final training stages, by reducing calculation costs. A light classifier based on rapid text improves the speed and precision of filtering more. Applied to Fineweb and Chinese Fineweb data sets, this method produced the Ultra-Fineweb data set, containing 1 English Billion and 120 billion Chinese tokens. The LLM formed on the Ultra-Fineweb have shown notable performance gains, confirming the efficiency of the pipeline in improving data quality and the efficiency of the training.
The study describes an efficient and high quality data filtering pipeline to reduce calculation costs while maintaining data integrity. It begins by using a profitable verification strategy to select reliable seed samples in a pool of candidates, which are then used to form a data classifier. The positive seeds come from LLM annotations, organized data games, manuals and synthesized content, while the negatives come from various corpus. The classifier training avoids over -rings, rather focusing on the selection of high quality seeds. A rapidly based classifier is used for scalable filtering, offering competitive performance with significantly lower inference costs compared to LLM -based methods, with pre -treatment steps guaranteeing a balanced and clean data input.
The models were formed using Megatronlm with the MINICPM-1.2 B architecture on 100B tokens. The evaluations used Lighteval through English and Chinese landmarks. The results show that the models formed on the Ultra-Fineweb have constantly surpassed people formed on Fineweb and Fineweb-Edu, individually and in mixed environments. Ultra-Fineweb-en has obtained the highest average English scoring, while ultra-fineweb-zh performance has improved Chinese tasks. Ablation studies have revealed that ultra-fineweb maintains balanced token lengths and the advantages of effective filtering strategies, highlighting its superior quality and efficiency in improving model performance.

In conclusion, the study presents Ultra-Fineweb, a set of high quality multilingual data comprising about 1 billion of English tokens and 120 billion Chinese tokens. Built on Fineweb and Chinese Fineweb, it operates a new efficient data filter pipeline with a light classifier based on a fast text and a low -cost verification strategy. The pipeline improves the precision of filtering, reduces dependence on the selection of manual seed data and provides robust performance with minimum general costs. The experimental results show that the models formed on the ultra-fine constantly surpass those formed on anterior data sets, demonstrating improved performance between references. The methodology guarantees reproducibility and offers valuable information to optimize data quality in the future LLM training.
Discover the Paper And Data set. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 90K + ML Subdreddit.
Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.
