Reinforcement learning, not finished: Nemotron-Tool-N1 Train LLMS to use tools with minimum supervision and maximum generalization

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The equipment of LLM with external tools or functions has become popular, showing great performance in various fields. Existing research depends on the synthesis of large volumes of trajectories for using tools thanks to advanced language models and SFT to improve the call capacity of LLMS tools. Critical limitation lies in the inability of synthetic data sets to capture explicit reasoning stages, resulting in superficial training of tool calls. In many cases, reasoning is either completely omitted during training, or delayed in inference thanks to incentive techniques. The result is a pseudo-gone: the models simply learn to imitate the models at the surface level without really understanding the underlying decision-making process.

Existing research explores several approaches to improve LLMS tools for use. Previous methods focused on two key strategies to improve tool learning. The first approach focused on the conservation of the data set and the refinement of the model, involving the creation of large -scale supervised data sets and the application of advanced training techniques such as SFT and learning DPO strengthening. LLMs are combined with various external tools, including search engines, calculators, vision tools and Python interpreters, to extend their functional capacities. The second approach targeted the improvement of reasoning, going from the traditional train scaling to more complex testing strategies. Previous methods have relied on step level supervision and the reward models learned to guide the trajectories of reasoning.

Researchers from NVIDIA, Pennsylvania State University and Washington University proposed the Nemotron-Research-Tool-N1 series to meet the limits of the methods of using existing tools. It diverges from traditional SFT and reasoning traces the distillation techniques by implementing a unique RL paradigm. Inspired by the success of Deepseek-R1, a light supervision method has been developed to focus on structural validity and functional assessment of tool invocations. The Nemotron-Research-Tool-N1 model uses a binary reward mechanism which allows the model to develop reasoning strategies independently without relying on explicitly annotated reasoning trajectories.

Researchers unify and prepare data from existing tool call data sets, Xlam, and a subset of tools, which provide synthetic and multi-tours synthetic tools. A light incentive model is created to guide the generation of tool calls, with explicit instructions for intermediate reasoning inside Tags and invocation of tools locked up in . The model helps to minimize rigid fitness constraints and reduce the risk of over-adjustment to specific rapid models. The main model of skeleton used is QWEN2.5-7B / 14B-instructor, and to assess the generalization capacity of the proposed method, the evaluations are carried out on alternative skeleton models, including several variants of the LLAMA family.

The results on BFCL and API-Bank benchmarks show the higher performance of the Nemotron-Research-Tool-N1 models. On the BFCL reference, the Tool-N1-7B / 14B models surpass closed source models such as GPT-4O and specialized fine adjustment models such as Xlam-2-70B and Tooce-8B. The models exceed the basic SFT lines formed on identical data sources, highlighting the effectiveness of the RL R1 approach. In addition, the API-Bank reference validates these results, the NN1-7B / 14B tool reaching 4.12% and 5.03% higher precision than GPT-4O. These results conclude the potential of the proposed method to improve high -language tool call capacities through a new strengthening learning paradigm.

In conclusion, the researchers introduced Nemotron-Research-Tool-N1, a significant progression in the capacities for using LLM tools. Research shows a paradigm shift in traditional SFT methodologies by introducing a new RL approach based on rules. The proposed method allows models to develop sophisticated reasoning strategies without counting on explicitly annotated reasoning trajectories. Reference assessments through BFCL and API-Banque systematically validate the effectiveness of the approach, showing improvements in substantial performance compared to existing basic lines. The results open up new ways to develop more adaptable and intelligent language models that can independently generate reasoning strategies.


Discover the Paper And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 90K + ML Subdreddit.

Here is a brief overview of what we build on Marktechpost:


Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.