The researchers of Bytedance introduce the seed coder: a code code focused on the model formed on 6 billions of tokens

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Covedment code LLM training via pipeline pipelines scalable

Code data play a key role in the training of LLM, benefiting not only from coding tasks but also wider reasoning capacities. While many open source models are based on manual filtering and rules designed by experts to organize code data sets, these approaches take time, biased and difficult to evolve between languages. Owner models like Claude 3.7 and Openai O3 excel in coding tasks but do not share details on their data. Even the open source models like Deepseek and Qwen2.5 always depend strongly on the filters designed by humans. However, this dependence limits progress, echoing “the bitter lesson” that real breakthroughs come from evolving and data -based methods, not from hand -made heuristics.

The pipeline first of the model of the seed coder minimizes human dependence in the pre-training

The researchers of Bytedance introduce the Coder Seed, an Open Source 8B LLMS family, including basic models, instruction and reasoning, designed to reduce human involvement in the conservation of code data. Instead of relying on manual rules, their pipeline centered on the model uses LLM to mark and filter large -scale code data from sources such as Github and code -related websites, resulting in a 6 billion set of data. The instructions model is refined using synthetic data and preferences optimization, while the reasoning model improves code logic in several stages via long -term strengthening learning. The seed coder reaches the highest performance for its size, often exceeding larger models and is openly shared to encourage additional research and development.

6-milliard token corpus built with LLM quality filters on Github and web data

The co-coder is formed using an approach focused on the model that minimizes manual intervention. The pre-training corpus includes approximately 6 billions of tokens, from various sources, including the GitHub code, validation stories and code data related to the code. Initially, basic filtering deletes files with syntax problems or inappropriate content. Then, large languages ​​models are used to assess and note the remaining code, guaranteeing high quality data without relying on hand -made rules. The pre-training occurs in two stages: first, with the basic code and the web data, and later, with more complex structures, such as complete standards and long-context tasks, such as the filling of the environment, to improve the model's coding capacities.

Post-training via the instructions adjustment and longcot allows an understanding of the code in several stages

After pre-training, the co-coder undergoes additional refinement through two post-training stages. First, the instructions model is formed using a supervised fine adjustment on a diversified set of data from synthetic instructions generated and filtered by LLMS, helping him better understand and follow human guests. Then, its performance is improved using direct optimization of preferences (DPO), which aligns the model responses more closely with human preferences. For complex reasoning tasks, the reasoning model is improved by using the learning of longcot reinforcement, which strengthens its capacity to manage coding challenges in several steps. These steps considerably stimulate the performance of the coder of seeds through the different tasks of code generation of code and reasoning.

Seed coder excels in code generation, publishing and reasoning in several stages

The evaluation reveals that the three models of seed code, the base, the instruction and the reasoning work exceptionally well through a range of coding tasks. The basic model surpasses the other open source models similar to code generation tasks, making solid scores on benchmarks like Humaneval and Multipl-E. The instructor model excels in tasks requiring code editing and monitoring of instructions, leading to assessments such as Codeediritbench and Fullstack. The reasoning model, formed with long -term reflection techniques, shows exceptional skills in several stages of problem solving, in particular on difficult benchmarks like Livecodebench and the code of code, even exceeding models which are several times greater.

In conclusion, the seed code is a family of effective and efficient open source language models designed specifically for coding tasks. These models are distinguished by largely based on LLM rather than humans to filter and organize training data, which considerably reduces manual effort. Although it is trained on fewer tokens compared to certain larger models, the seed coder has exceptional performance in tasks such as code generation, completion, publishing and reasoning. However, its capacities in the general understanding of languages ​​are always limited due to the absence of large web data and mathematical content. Future updates aim to extend the family family and improve their capacities in different model sizes.


Discover the Paper,, Model series,, GitHub page And Project page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.