Reinforcement learning makes LLMS Search-SAVVY: researchers from the ANT group introduce SEM to optimize the use and effectiveness of the reasoning of tools

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Recent progress in LLMs have shown their potential in the execution of complex reasoning tasks and effectively use external tools such as search engines. Despite this, the teaching of models to make intelligent decisions at the time of relying on internal knowledge in relation to research remains a key challenge. Although simple methods based on prompts can guide models to invoke tools, LLMs always have trouble with more nuanced behavior, such as recognition when an initial research was incorrect and decided to look again. RL was explored to improve these behaviors by rewarding effective research use. However, RL often leads to unnecessary use of tools, with models performing redundant research, even simple tasks, highlighting the ineffectiveness that must be treated.

Various RL strategies, including optimization of proximal policy (PPO), direct optimization of preferences (DPO) and the relative optimization of group policies (GRPO), were used to align the behavior of the LLM with human expectations. PPO helps to balance the exploration of learning with the maintenance of policies stability, while the DPO simplifies alignment by directly optimizing model responses according to user preferences. GRPO introduces group assessments to better capture subtle improvements in reasoning. Meanwhile, the treatment of LLM as independent agents who plan and perform reasoning tasks in several stages gain ground. Managers like Autogpt and Langchain show how these agents can refine their outings thanks to iterative reasoning and research. However, current agent systems often depend on fixed prompts or the use of heuristic tools, limiting their adaptability and efficiency.

Ant Group researchers introduce SEM, a learning framework by post-training designed to teach LLM when using research tools and when counting internal knowledge. In training on a balance of balanced data combining questions that do and do not require external recovery, Sem guides the model to issue research requests only when necessary. Using a structured and GRPO reasoning format, the frame rewards precise responses without research and penalizes the useless use of tools. The results show that SEM improves the accuracy and efficiency of the response, helping the models to better judge when external information is necessary, thus improving reasoning in complex scenarios.

To integrate research tools into the reasoning process of a model, SEM uses strengthening learning to teach models when and how to effectively use research. Training data combines music (questions requiring external information) and MMLU (questions responsible for previous knowledge), helping the models to learn to judge when research is necessary. Using the GRPO frame, the model is rewarded for precise and effective responses, discouraging unnecessary research and encouraging them when internal knowledge is insufficient. A structured response format (,, ,, ,, ) Standardizes training and allows a precise reward assignment, improving both the quality of reasoning and research decision.

The study estimates a model formed to determine when to rely on your internal knowledge and when to use external research. He combines music (unknown questions) and MMLU (familiar questions) for training and assesses performance on data sets like Hotpotqa, GSM8K and Mmlu. The proposed SEM method surpasses the basic bases as naive CLOTH and research in the accuracy of the responses and the efficiency of the research. SEM reduces unnecessary research on known issues while improving reasoning on unknown research. Case studies and training curves confirm SEM stable learning and intelligent decision -making. Overall, SEM improves recovery decisions and internal reasoning in large -language models.

In conclusion, SEM is a post-training learning framework designed to improve the way in which large language models use external research tools. The model is formed on a set of data combining music and MMLU, helping him to distinguish the questions to which he can answer internally and those which require external recovery. SEM uses a structured reasoning approach and a reward function that penalizes unnecessary research while promoting precise and effective recovery. Experiences on benchmarks like Hotpotqa, GSM8K and Mmlu show that SEM reduces redundant research and improves precision. This approach improves the effectiveness of reasoning and intelligent use of external knowledge in LLM.


Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit.


Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.