Introduction
Large languages (LLM) models have shown substantial improvements in reasoning and precision thanks to the learning of strengthening (RL) and techniques of testing time scale. Despite the outperformance of traditional methods of generation of unit tests, most existing approaches such as the O1 Coder and UTGEN require supervision from the Truth code on the ground. This supervision increases data collection costs and limits the extent of usable training data.
Limits of existing approaches
The generation of conventional unit tests is based on:
- Software analysis methodswhich are based on rules and rigid.
- Neural machines translation techniquesoften lacking in semantic alignment.
Although recent methods based on a prompt and agental improves performance, they always depend strongly on the labeled code for fine adjustment. This dependence restricts adaptability and scalability, in particular in the large -scale deployment scenarios of the real world.
Cure: A self-supervised self-evolutionist approach
Researchers from the University of Chicago, the University of Princeton, the University of Beijing and the Semen of Bytedance HEALAn auto-leveled reinforcement learning framework that jointly trains a code generator and a unit test generator without any Truth code on the ground.
Cure works using an auto-play mechanism in which:
- The LLM generates correct and incorrect code.
- The unit test generator learns to distinguish modes of failure and is refined accordingly.
This bidirectional co-evolution improves both code generation and verification without external supervision.

Architecture and methodology
Basic models and sampling strategy
Healing is built on QWEN2.5-7B and 14B instructions, with QWEN3-4B used for long-term variants (CO). Each stage of training samples:
- 16 Completion of candidate code.
- 16 unit tests derived from tasks.
The sampling is carried out using VLLM with temperature 1.0 and TOP-P 1.0. For long COT models, a transformation of the length of the response penalizes long outings, improving the efficiency of the inference time.
Reward and optimization function
Cure introduces a mathematically based reward formulation on:
- Maximize Award precisionDefined as the probability that the correct score code higher than the incorrect code between the unit tests generated.
- Apply reward adjustments based on the response for long responses to reduce latency.
Optimization takes place via policy gradient methods, jointly updating the coder and unit tester to improve their mutual performance.

Reference data sets and evaluation measures
Healing is evaluated on five standard coding data sets:
- Lively
- Mbpp
- Livecodebench
- Codeconstations
- Coded forces
The performance is measured through:
- Precision of unit tests
- Code generation precision at one time
- Precision Better-de-N (good) using 16 code and test samples.

Performance and efficiency gains
THE Coding reason The derived models via healing realize:
- + 37.8% in the precision of unit tests.
- + 5.3% In the precision of code generation at a blow.
- + 9.0% in the precision of good.
In particular, reasonflux-coder-4b abblie 64.8% Reduction of the average length of the response of unit tests – improving the substantial inference speed. In all benchmarks, these models surpass traditional coding coding models (for example, Coder QWEN2.5).
Application to commercial LLMs
When reasonflux-Codeer-4B is twinned with GPT series models::
- GPT-4-MINI Gains + 5.5% good precision.
- GPT-4.1-mini improves by + 1.8%.
- API costs are reduced while performance is improved, indicating a profitable solution for inference pipelines in terms of production.
Use as a reward model for fine label adjustment
Unit test generators trained in healing can be reused as reward models in RL training. The use of unit tests generated from the Coder of Respa Learning -free learning pipelines without label.
Wider applicability and future orientations
Beyond good, the models of Coder reasonflux integrate transparently into agency coding frames like:
- MPSC (multi-personal self-coherence)
- Alphacode
- S*
These systems benefit from the ability of treatment to refine the code and tests in an iterative manner. Cure also increases the accuracy of the generation of more agency units units 25.1%strengthen its versatility.
Conclusion
Cure represents an important increase in self-supervised learning for generation and validation of the code, allowing important language models to jointly evolve their coding and generation of unit tests without depending on the soil verification code. By taking advantage of a co-evolutionary strengthening learning framework, not only heal the basic performance metrics such as the unique precision and the selection of the best of N, but also improves the efficiency of inference by optimizing the length of the response. Its compatibility with existing coding pipelines and its ability to operate as a tagged reward model make it an evolutionary and profitable solution for training and deployment scenarios.
Discover the Paper And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 99K + ML Subreddit and subscribe to Our newsletter.
Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.
