Why Apple’s criticism on AI reasoning is premature

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The debate on the reasoning capacities of major reasoning models (LRM) has recently been invigorated by two leading articles: “the illusion of thought” of Apple and the refutation of anthropic entitled “The illusion of the illusion of thought”. Apple's article claims the fundamental limits of LRM reasoning capacities, while Anthropic maintains that these affirmations arise from evaluation gaps rather than model failures.

Apple's study systematically tested LRM on controlled puzzle environments, observing a “precision collapse” beyond specific complexity thresholds. These models, such as Claude-3.7 Sonnet and Deepseek-R1, would not have succeeded in solving puzzles like the Tower of Hanoi and the river crossing as the complexity increased, even by having a reduction in reasoning efforts (use of tokens) to higher complexities. Apple has identified three distinct complexity diets: the standard LLMS surpass LRM with low complexity, the LRMs excelled with average complexity, and both collapse with high complexity. Critically, Apple's assessments concluded that the limits of LRMs were due to their inability to apply an exact calculation and coherent algorithmic reasoning between puzzles.

Anthropic, however, calls into question Apple's conclusions, identifying critical defects in experimental conception rather than the models themselves. They highlight three major problems:

  1. Limitations of tokens compared to logical failures: Anthropic stresses that the failures observed in the experiences of the Apple Hanoi Tower were mainly due to the limits of a exit token rather than with reasoning deficits. The models explicitly noted their token constraints, deliberately truncating their outings. Thus, what appeared as a “collapse of reasoning” was essentially a practical limitation, not a cognitive failure.
  2. Erroneous classification of the distribution of reasoning: Anthropic identifies that Apple's automated assessment framework has misinterpreted intentional truncations as reasoning failures. This rigid notation method has not adapted awareness of models and decision -making concerning the length of the output, leading to unjustly penalized penalty.
  3. Insoluble problems misinterpreted: Perhaps the most important, Anthropic shows that some of the Apple river crossing references were mathematically impossible to resolve (for example, the cases with six or more people with a boat capacity of three). The notation of these insoluble bodies as failures have radically biased the results, which makes the models incapable of resolving fundamentally insoluble puzzles.

Anthropic also tested an alternative representation method – as models to provide concise solutions (such as Lua functions) – and found high precision even on complexly labeled complex puzzles as failures. This result clearly indicates that the problem concerned evaluation methods rather than reasoning capacities.

Another key point raised by anthropic concerns the complexity metric used by the apple – comparable depth (number of movements required). They argue that this metric confuses mechanical execution with a real cognitive difficulty. For example, while Tower of Hanoi Puzzles exponentially requires more movements, each stage of decision is trivial, while puzzles like the crossing of the river involve fewer steps but a significantly higher cognitive complexity due to the satisfaction of constraints and research requirements.

The two articles contribute significantly to the understanding of SGLAs, but the tension between their results highlights a critical gap in the evaluation practices of the current AI. Apple's conclusion – that LRM intrinsically lacks robust and generalizable reasoning – is considerably weakened by Anthropic's criticism. Instead, Anthropic's results suggest that LRMs are limited by their testing environments and their evaluation frames rather than their intrinsic reasoning capacities.

Given these ideas, future research and LRM practical assessments must:

  • Clearly differentiate reasoning and practical constraints: The tests should adapt to the practical realities of tokens limits and the model decision -making.
  • Validate the solvency of the problem: Ensuring puzzles or tested problems is solved is essential for a fair assessment.
  • Refine the complexity measures: The measures must reflect real cognitive challenges, and not only the volume of the mechanical execution stages.
  • Explore various formats of solutions: The evaluation of LRMS capacities through various representations of solutions can better reveal their underlying reasoning forces.

In the end, Apple's assertion that the LRM “cannot really reason” seem premature. Anthropic refutation shows that LRMs have sophisticated reasoning capacities which can manage substantial cognitive tasks when evaluated correctly. However, he also highlights the importance of cautious and nuanced evaluation methods to really understand the capacities and limits – emerging AI models.


Discover the Apple paper And Anthropic paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.