Apple researchers reveal structural failures in large reasoning models using puzzles based on puzzles

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Artificial intelligence has undergone a significant transition from basic language models to advanced models which focus on the reasoning of tasks. These new systems, known as major reasoning models (LRM), represent a class of tools designed to simulate human thought by producing intermediate reasoning stages before arriving at conclusions. The emphasis went from the generation of precise outings to understanding the process that leads to these responses. This change has raised questions about how these models manage tasks with layers in layers and if they really have reasoning skills or simply exploit training models to guess the results.

Redefine the evaluation: go beyond the precision of the final responses

A recurring problem with the evaluation of the reasoning of machines is that traditional benchmarks mainly assess the final response without examining the stages linked to achieving it. The precision of the final responses alone does not reveal the quality of the internal reasoning, and many benchmarks are contaminated by data which may have been observed during the training. This creates a misleading image of the real capacities of a model. To explore real reasoning, researchers need environments where problematic difficulties can be accurately controlled and intermediate steps can be analyzed. Without these parameters, it is difficult to determine whether these models can generalize solutions or simply memorize models.

To assess the reasoning with more reliably, Apple's search team designed a configuration using four puzzle environments: Tower of Hanoi, River Crossing, Sauns and Blocks World. These puzzles allow precise manipulation of complexity by changing elements such as the number of discs, verifiers or agents involved. Each task requires different reasoning capacities, such as constraint satisfaction and sequential planning. Above all, these environments are exempt from typical data contamination, allowing in -depth checks of the two results and stages of reasoning between the two. This method guarantees a detailed survey on how models behave through various tasks.

Research has introduced a comparative study using two sets of models: Claude 3.7 Sonnet and Deepseek-R1, as well as their “thought” variants and their standard LLM counterparts. These models have been tested through puzzles under identical token budgets to measure both the accuracy and efficiency of reasoning. This has contributed to revealing performance changes through weak, medium and high tasks. One of the most revealing observations was the formation of three performance zones. In simple tasks, unhealthy models have outperformed the variants of reasoning. For an average complexity, the reasoning models have won an edge, while the two types have completely collapsed as the complexity has reached its peak.

Comparative insistence: reflection in relation to models not thought under stress

An in -depth analysis revealed that the reasoning effort increased with the difficulty of the task to a certain point, but then decreased despite the availability of resources. For example, in the Tower of Hanoi, Claude 3.7 Sonnet (thought) has maintained great precision until the complexity reaches a certain threshold, after which the performance fell to zero. Even when these models were supplied with explicit solution algorithms, they failed to perform steps beyond the specific levels of complexity. In one case, Claude 3.7 could manage about 100 steps properly for the Hanoi tower, but could not perform simpler river crossing tasks requiring only 11 movements when n = 3 $. This inconsistency exposed serious limitations in symbolic manipulation and exact calculation.

Performance ventilation also stressed how LRM manages their internal reflection process. The models have frequently engaged in a “reflection on reflection”, generating correct intermediate solutions at the start of the process, but continue to explore incorrect paths. This has led to ineffective use of the tokens. At average complexity levels, the models began to find correct answers later in their reasoning chains. However, at high levels of complexity, they failed to produce specific solutions. The quantitative analysis confirmed that the precision of the solution had fallen to zero as the complexity of the problem increased, and the number of allocated reasoning tokens began to decline unexpectedly.

Scaling limits and collapse of reasoning

This research presents an evaluation which gives to reflect on the current functioning of learning resources management systems (LRMS). Apple's search clearly indicates that, despite certain progress, today's reasoning models are still far from achieving generalized reasoning. Work identifies the way in which performance scales, where it collapses and why excessive dependence on reference accuracy does not manage to capture deeper reasoning behavior. The controlled puzzle environments have proven to be a powerful tool to discover hidden weaknesses in these systems and highlight the need for more robust conceptions in the future.


Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 99K + ML Subreddit and subscribe to Our newsletter.


Nikhil is an intern consultant at Marktechpost. It pursues a double degree integrated into materials at the Indian Kharagpur Institute of Technology. Nikhil is an IA / ML enthusiast who is still looking for applications in fields like biomaterials and biomedical sciences. With a strong experience in material science, he explores new progress and creates opportunities to contribute.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.