Apple And Duke Researchers Have A Strengthening Learning Approach That Allows LLM To Provide Intermediate Responses, Improve Speed And Precision

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Long cot reasoning improves the performance of large -language models on complex tasks but is delivered with drawbacks. The typical method of “think-toswer” slows response times, disturbing real-time interactions like those of chatbots. It also risks inaccuracies, because errors in the previous reasoning stages can lead to a misleading final response. Unlike humans, who often share partial thoughts or conclusions during conversations, LLMs delay the answers until all reasoning is finished. Although RL is commonly used to form reasoning models, it mainly rewards the final responses, overlooking useful intermediate information. There is an increasing interest in teaching models that alternate between thought and response, but it remains a challenge.

RL has become a popular method to improve reasoning in LLMs, based on its success in the alignment of models with human preferences. Two types of current rewards guide RL: Rewards based on results (ORM), which focus on the final response and processes based on (PRM), which provide feedback on intermediate reasoning stages. Although the PRMs offer more detailed supervision, they often rely on human annotation and additional models, which makes them complex and subject to problems such as reward hacking. In addition, efforts to improve LLM reasoning have explored incentive strategies, structured reasoning, integration of tools and methods to reduce latency and improve efficiency.

Researchers from Apple and Duke University have intertwined reasoning, a new RL approach that allows linguistic models to alternate between thought and answer when resolving complex questions and in several stages. Instead of waiting until the end to respond, the models provide informative intermediate responses, which improves user comments and guides their reasoning. By using a reward based on simple rules, the model is formed to produce useful reasoning stages, leading to more than 80% faster responses and up to 19.3% better precision. Trained only on QA and logical data sets, the method demonstrates strong generalization to more difficult references, such as mathematics, GPQA and MMLU.

The study offers a strengthening learning framework to train LLM for intertwined reasoning, where models alternate between internal thinking and intermediate responses oriented by users. Each intermediate step, or “sub-response”, is shared once the model reaches a significant step in the reasoning. A specialized training model with And The beacons are used. The approach uses rules based on rules – in particular format, final accuracy and conditional intermediate precision – to guide learning. In particular, intermediate awards are only applied when specific criteria are met, ensuring that the model prioritizes overall accuracy. They also test different reward patterns, such as all or nothing partial credit and rewards arranged over time, to optimize the quality of reasoning.

The intertwined reasoning approach was evaluated on sets of familiar and unknown data using QWEN2.5 models (1.5b and 7b). Unlike traditional methods that separate thought and response, the intertwined method provides answers gradually, improving both speed and utility. When combined with intermediate awards, it considerably improves the performance of the model while reducing response delays by more than 80%. Even without exposure to new areas during training, the model adapts well, showing a strong generalization. These results highlight the value of the reasoning intertwined in the yield of more reactive and effective AI systems in reasoning tasks in several real world stages.

Apple and Duke researchers have a strengthening learning approach that allows LLM to provide intermediate responses, improve speed and precision

In conclusion, the study explores how intertwined reasoning – where models alternate between reasoning and generation of intermediate responses – can considerably improve performance and reactivity. Using the QWEN2.5-1.5B model, the authors show that the provision of intermediate feedback in a timely manner during the training stimulates precision and accelerates the generation of response. Different RL strategies have been tested, PPO showing stable results, and conditional rewards and arranged over time. The method evolves well towards complex tasks and surpasses the traditional basic lines of reflection to the fight against responses. Unlike the reward models at the token level, this approach uses rewards based on simple rules after completing the full reasoning stages, thus avoiding the hacking of reward. In the end, the intertwined reasoning improves the quality of reasoning and efficiency without relying on external tools.

Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.

Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.

Leave a Comment Cancel reply

Join our community

LEARNOPOLY

Categories

Popular

About

Apple and Duke researchers have a strengthening learning approach that allows LLM to provide intermediate responses, improve speed and precision