While the major reasoning models (LRMS) have shown impressive capacities in the reasoning in a short context compared to strengthening learning (RL), these gains are not generalized well in long -context scenarios. Applications such as AQ with several documents, research synthesis and legal or financial analysis require models to process and reason on sequences greater than 100,000 tokens. However, the optimization of RL in such regimes is plagued by a slower convergence of reward, unstable updates due to KL divergence fluctuations and reduced exploration resulting from the collapse of entropy. These bottlenecks reveal a fundamental gap in the transition of the LRMS of competence in a short context to the generalization of the context with long context.
Qwenlong-L1: a structured RL frame for long context adaptation
To respond to these limitations, the Qwen research team presents Qwenlong-L1A new RL frame designed to adapt LRM to long -term reasoning tasks. The frame is structured in three key steps:
- Adjusted final warm -up (SFT): Provides a stable initialization for the policy model by forming organized triplets of questions-contained, ensuring the basic competence of contextual understanding and the extraction of responses.
- Learning by progressive strengthening guided by programs: Introduce a training process staged with a gradual increase in context lengths. This progression allows the model to gradually acquire long -term reasoning behaviors without destabilizing policy updates.
- Retrospective difficulty sampling: Improves exploration by maintaining and reusing difficult examples of the previous phases, weighted by their difficulty, to encourage deeper reasoning and robustness on various entries.
These steps are supplemented by hybrid reward mechanisms – combining the exact verification of the correspondence of the rules with the semantic evaluation by a light LLM – by invalidating both precision and recall during political training.

Technical design and methodological advantages
Qwenlong-L1 incorporates recent progress in the RL optimization of the group, in particular GRPO And DapoTo alleviate the general calculation costs associated with the estimation of the value of the long -term context:
- GRPO estimates the advantage by normalizing the awards within the sampled groups, eliminating the need for a network of separate value and by encouraging various production models.
- Dapo Includes mechanisms such as dynamic sampling, the installation of the length penalty and asymmetrical clipping thresholds to prevent the collapse of entropy and alleviate length biases during training.
The reward function is defined as the maximum of two signals: a deterministic correspondence based on rules and a semantic judgment of a compact assessor model (for example, QWEN2.5-1.5B). This hybrid approach prevents over-adjustment of rigid formats while retaining the correction of responses through various ratings and sentences.
In addition, the frame is optimized via Progressive context scaleWhere the RL process goes from 20k-token to entry lengths of 60K in controlled phases, stabilizing training dynamics and facilitating the generalization of policies.
Experimental results and reference performance
Qwenlong-L1 was evaluated on seven landmarks QA Long context document, notably Docmath, Frames, 2wikimultihopqa, Hotpotqa, Mucque, Narrativeqa and Qasper. The 32B variant, Qwenlong-L1-32Bdemonstrated with strong empirical performance:
- He has outperformed basic models such as R1-DISTILL-QWEN-32B by 5.1 points and exceeded the main proprietary systems as OPENAI-O3-MIN And QWEN3-235B-A22B.
- His performance was Comparable to Claude-3.7-Sonnet-Thinkingindicating competitive reasoning capacities under extreme context durations.
- Passage analysis @ ka revealed consistent improvements with an increase in sampling, by carrying out a pass to 2 means of 73.7outdated Deepseek-R1 And OPENAI-O1-PREVIEWeven at low sampling rates.

Ablation studies have also validated the individual contributions of the SFT, the progressive RL and the retrospective sampling. In particular, RL has played a decisive role in activating emerging reasoning behavior such as earth settings, sub -engine adjustment, verification and backwards – features not effectively induced by a supervised fine adjustment.
Conclusion
Qwenlong-L1 represents a systematic approach for LRM equipment with robust long context reasoning capacities thanks to strengthening learning. Its design effectively fills the gap between expertise in a short-contained context and the requirements of dense information environments by combining supervised initialization, scope of the context focused on curriculum and hybrid evaluation strategies. The framework not only obtains state -of -the -art results through long context references, but also demonstrates the emergence of interpretable reasoning models during training.
Discover the Paper, Model on the embraced face And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.
