The effectiveness of pre-training and the generalization of models of large languages (LLM) are considerably influenced by the quality and diversity of the underlying training corpus. Traditional data feeding pipelines often treat quality and diversity as distinct objectives, applying quality filtering followed by domain balancing. This sequential optimization neglects complex interdependencies between these factors. High quality data sets frequently have domain biases, while diversified data sets can compromise quality. In the context of fixed training budgets, it is essential to optimize simultaneously for both dimensions in order to maximize the performance of the model. However, the joint definition and optimization of quality and diversity remain non -trivial challenges.
Bytedance presents QuadMix
Bytedance presents Quadmix, a unified data selection framework that systematically balances quality and diversity during LLM pre-training. Quadmix assesses each sample of data on the basis of multiple quality criteria and domain classifications and determines its probability of sampling via a configured function. The framework uses proxy model experiences combined with a lightgbm regression to predict downstream performance, allowing effective optimization of parameters without large -scale exhaustive formation. Experiments show that Quadmix reaches an average performance improvement of 7.2% on several benchmarks compared to the methods optimizing quality and diversity separately, highlighting the effectiveness of a joint approach.

Quadmix works in three main steps: extraction of characteristics, quality aggregation and quality awareness sampling. Initially, each document is annotated with domain labels and multiple quality scores. These scores are standardized and merged using parameters specific to the domain to calculate an aggregated quality score. The documents are subsequently sampled according to a function based on Sigmoid which hierartes the better quality samples while maintaining the balance of the domain by configured controls.
Optimization is carried out by forming thousands of proxy models in different parameter settings. A regression model, formed on these proxy experiments, predicts performance results, allowing the identification of optimal sampling configurations. This method allows a structured exploration of a high -dimension parameter space, aligning the selection of data more closely with the downstream tasks.
Quadmix offers several advantages:
- Unified optimization of the quality of the data and the diversity of the domain.
- Adaptability to specific tasks requirements through the selection of the proxy evaluation target.
- Calculation efficiency by bypassing comprehensive model recycling.
- Coherent improvements in downstream performance without increasing calculation budgets.
Experimental results and ideas
The validation experiences were carried out using the refined data set, forming models of 530m parameters from zero. Quadmix was compared to several baselines, including the random selection, Fineweb-Edu, Askllm, DCLM, DSIR and Régmix. Quadmix systematically surpassed these methods, reaching an average score of 39.5% over nine various benchmarks.
Key observations include:
- Joint optimization strategies constantly surpass isolated methods focused on quality or diversity.
- The performance of the proxy model is strongly correlated with the results of the large -scale model, validating the efficiency of the approach based on proxy.
- Data mixtures optimized for specific tasks further improve the performance of the task.
- The merger of multiple quality criteria reduces inherent biases and improves the overall robustness of the model.
- The widening of the diversity of tokens beyond a certain threshold gives decreasing yields, stressing the importance of the quality organized in relation to quantity.

Conclusion
Quadmix offers an approach in principle to select data for LLM pre-training, resolving the long-standing challenge to simultaneously optimize the quality and diversity of data. By integrating quality aggregation and sampling devoted to the field in a unified framework and by taking advantage of optimization based on proxy, Quadmix establishes an evolutionary methodology to improve the efficiency of LLM pre-election. Although there are possibilities for future improvements, such as the refinement of the parameter space and improving the loyalty of the Proxy model – the Quadmix represents an important step towards more systematic and effective data preservation strategies for the development of large -scale models.
Discover the Paper. Also, don't forget to follow us Twitter And join our Telegram And Linkedin Group. Don't forget to join our 90K + ML Subdreddit.
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.
