Longwriter-Zero: a strengthening learning framework for the generation of ultra-long-term text without synthetic data

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Introduction to the challenges of ultra-long text generation

The generation of ultra-long texts that extend over thousands of words become more and more important for real tasks such as narration, legal writing and educational material. However, large language models are always faced with significant challenges, including length limits and quality problems, because their results become longer and longer. Current problems include inconsistency, subject to the subject, repetition and poor structure. Previous methods, such as Longwriter, use a supervised fine adjustment on synthetic data to solve this problem; However, this data is expensive to create, difficult to generate and often feel unnatural. In addition, relying on existing LLMs to create training data limits creativity, and typical training methods do not effectively improve overall coherence or formatting long outings.

Evolution of long -forming text generation methods

Recent research on the generation of long texts has focused on improving consistency, personalization and extension of the output length beyond 2,000 words. The first models, such as RE3 and DOC, used recursive strategies to maintain the structure, while Longlamp and others introduced personalization with the self-training of reasoning. Suri has built a large set of instructions monitoring data, but was limited to the outputs of less than 5,000 tokens due to the dependence on feedback. Longwriter has advanced this by generating 6K to 20k tokens outings using the optimization of fine settings and supervised preferences, although it has kept the biases of its teacher models. On another front, RL has improved the reasoning in the LLM as Deepseek-R1 and QWQ-32B, but RL remains under-explored for an ultra-long-term text generation.

Longwriter-Zero: learning of strengthening without synthetic data

Researchers from Tsinghua and SUTD University present Longwriter-Zero. This approach uses RL to train LLM for the generation of ultra-long-term text, not to mention annotated or synthetic data. From the basic model Qwen2.5-32B, they apply RL with carefully designed reward models targeting the length, quality and structure of the text. Their framework is inspired by the success of mathematics and coding tasks, exploring three key factors: reward design, time scaling and continuous pre-training. Longwriter-Zero exceeds traditional final adjustment methods, achieving advanced performance on writing and writing arenas, even exceeding 100B + models as Deepseek-R1.

New optimization strategy and comparison of comparative analysis

The study introduces a strengthening learning approach to improve the generation of ultra-long-length text using LLMS. The researchers rely on PPO with a method called relative optimization of the group's relative policy, forming a 32B parameter model on instructions monitoring data with a 14 km output limit. They evaluate outings using a new reference, the writing arena and design a reward system that balances the length of the text, mastery, consistency and format. A key overview is that having the “thinking” model before writing using intermediate reasoning steps leads to better structure and control. Other gains are made thanks to the pre-training on the heavy writing data, stressing the importance of a robust and writing base.

Results on long generation references

Longwriter-Zero is evaluated through a two-step process: continuous pre-training on long books using 30 billion tokens, followed by learning to strengthen more than 150 stages with “think” prompts to encourage reasoning. It marks 8.69 on Writingbench, outperforming GPT-4O (8.16), Qwen2.5-Max (8.37) and Deepseek-R1 (8.55), leading in five of the six of six areas. In Arena-Write, he reached the highest Elo score of 1447. The abolition of prompts to “think” or pre-training major performance declines, confirming their importance. The model also reaches a 98.2% victory rate in comparisons based on GPT-4.1, with human assessments validating its strength in long writing.

Conclusion and future perspectives on the design of awards

In conclusion, Longwriter-Zero offers a learning approach to strengthen the generation of ultra-length text, thus avoiding the need for sets of synthetic or labeled data. Built on QWEN2.5-32B and formed from zero, it uses reward models that target length control, quality of writing and formatting. He produced the best scores on Writingbench (8.69) and Arena-Write (Elo 1447), outperforming GPT-4O (8.16), Deepseek-R1 (8.55) and QWEN3-235B-A22B (ELO 1343). Human evaluations and based on GPT-4.1 show victory rates of up to 98.2%. However, it faces a hacking of a reward model, such as swelling of length by repetition or insertion of keywords as “quantum tangle” for higher scores. The fight against these limitations will require a better design of the awards and human strategies in loop.


Discover the Paper And Dataset. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


author profile Sana Hassan

Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.

a sleek banner advertisement showcasing

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.