The Learning Recipe For Reinforcement In Several Stages Of Enigmata Leads To Revolutionary Performance In The Reasoning Of Puzzle LLM

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Large models of reasoning (LRM), formed from LLM using strengthening learning (RL), have demonstrated great performance in complex reasoning tasks, including mathematics, rods and coding. However, existing LRMs face challenges to perform various puzzle tasks that require purely logical reasoning skills, which are easy and obvious for humans. Current methods targeting puzzles focus only on the design of benchmarks for evaluation, without training methods and resources so that modern LLMs take up this challenge. Current puzzle data sets lack diversity and scalability, covering limited types of puzzle with little control over generation or difficulty. In addition, due to the success of the “LLM + RLVR” paradigm, it has become crucial to obtain large -scale, various and difficult puzzle sets of verifiable puzzle prompts for training agents.

Learning strengthening with verifiable rewards (RLVR) has become a key method to improve model reasoning capacities, deleting the need for reward models by directly assigning rewards based on objectively verifiable responses. Puzzles are particularly well suited to RLVR. However, most RLVR's previous research has neglected the potential of puzzles to provide effective reward signals. In the reasoning of LLM puzzle, the existing benchmarks evaluate different types of reasoning, including abstract, deductive and compositional reasoning. Few benchmarks support evolutionary generation and difficulty control but lack diversity of puzzle. In addition, improving LLMS puzzle resolution capacities is mainly transformed into two categories: the integration of tools and the RLVR.

Researchers from Bytedance Seed, Fudan University, Tsinghua University, Nanjing University and Shanghai Jiao Tong University proposed Enigmata, the first complete toolbox designed to improve LLM with puzzle reasoning skills. It contains 36 tasks on seven categories, each with a generator that produces unlimited examples with controllable difficulties and a rules based on rules for automatic evaluation. The researchers have developed Enigmata-Eval as a rigorous reference and created optimized multi-tamed RLVR strategies. Enigmata puzzle data improves Sota's performance on mathematical and STEM reasoning tasks as loves, Beyondaime and GPQA when they are trained on larger models like Seed1.5-Thinking. This shows the advantages of the generalization of puzzles.

The learning recipe for reinforcement in several stages of Enigmata leads to revolutionary performance in the reasoning of Puzzle LLM

ENIGMATA data include 36 puzzle tasks organized in 7 main categories, including crypto, arithmetic, logic, grid, graphic, research and sequential puzzle, making it the only data set with several categories of tasks with scalability, automatic verification and public availability. The construction of data follows a pipeline in three phases: the collection and design of tasks, the development of automatic generator and verifier and the control of sliding difficulty. In addition, Enigmata-Eval is developed by systematically sampling from the wider data set, aimed at extracting 50 instances per difficulty level for each task. The final evaluation set contains 4,758 puzzle instances rather than the theoretical maximum of 5,400, due to inherent constraints, where certain tasks generate fewer instances per difficulty level.

The proposed model surpasses most public models on EnigMata-Eval with 32B settings, showing the efficiency of the data set and the training recipe. The model stands out on the difficult arch reference, exceeding solid reasoning models such as Gemini 2.5 Pro, O3-Mini and O1. QWEN2.5-32B-ENGMATA shows exceptional performance in structured categories of reasoning, surpassing cryptographic, arithmetic and logical tasks, suggesting effective development of reasoning capacities based on rules. The model shows competitive performance in research tasks that require strategic exploration and planning skills. In addition, crypto and arithmetic tasks tend to provide greatest precision, while space and sequential tasks remain more difficult.

In this article, the researchers introduced Enigmata, a complete suite to equip the LLM with advanced puzzle reasoning that fits perfectly with RL using verifiable rewards based on rules. The Enigmata formed model shows higher performance and robust generalization skills thanks to RLVR training. Experiences reveal that when applied to larger models such as Seed1.5-Thinking (parameters 20b / 200b), synthetic puzzle data provides additional advantages in other fields, including mathematics and STEM reasoning on advanced models. Enigmata provides a solid basis for the research community to advance the development of the reasoning model, offering a unified framework which actually folds the logical resolution of puzzle with wider reasoning capacities in LLM.

Discover the Paper,, GitHub page And Project page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.

Sajjad Ansari is a last year's first year of the Kharagpur Iit. As a technology enthusiast, he plunges into AI's practical applications by emphasizing the understanding of the impact of AI technologies and their real implications. It aims to articulate complex AI concepts in a clear and accessible way.

Leave a Comment Cancel reply

Join our community

LEARNOPOLY

Categories

Popular

About

The learning recipe for reinforcement in several stages of Enigmata leads to revolutionary performance in the reasoning of Puzzle LLM