Artificial Intelligence BigTech Companies Newswire Technology

Apple Study Questions AI’s True Reasoning Abilities

June 12, 2025Last Updated: June 12, 2025

2 minutes read

A vintage illustration depicts the Tower of Hanoi puzzle in three stages, showing the disks' movement between pegs.

▼ Summary

– Apple researchers found that simulated reasoning models (like OpenAI’s o1/o3 and Claude 3.7) rely on pattern-matching rather than systematic thinking when solving novel problems.
– Their study, led by Parshin Shojaee and Iman Mirzadeh, analyzed “large reasoning models” (LRMs) that use chain-of-thought reasoning to solve problems step-by-step.
– The team tested AI models on classic puzzles (e.g., Tower of Hanoi, river crossing) scaled from simple to highly complex versions.
– Current AI evaluations focus on answer accuracy in math/coding benchmarks but fail to assess whether models truly reason or just mimic training data.
– Both Apple and USAMO studies showed poor model performance (mostly under 5% success) on novel proofs, with severe degradation in extended reasoning tasks.

Apple researchers have raised important questions about whether current AI systems truly possess reasoning capabilities or simply excel at pattern recognition. A recent study from the tech giant suggests that even advanced language models struggle with novel problems requiring systematic thinking, performing more like sophisticated pattern-matchers than genuine reasoning engines.

The investigation focused on what scientists term “large reasoning models” – AI systems designed to simulate logical thought processes through step-by-step textual outputs. These models, including well-known versions from OpenAI, Anthropic, and DeepSeek, were tested against classic logic puzzles with varying complexity levels. The puzzles ranged from simple scenarios to versions requiring over a million computational steps.

Key findings revealed that when faced with unfamiliar mathematical proofs and complex logical challenges, the models performed poorly – most scoring below 5% accuracy. Only one system managed 25% accuracy, with zero perfect solutions across hundreds of attempts. This aligns with separate findings from mathematical olympiad researchers who observed similar limitations in AI problem-solving abilities.

The Apple team, led by Parshin Shojaee and Iman Mirzadeh, argues that current evaluation methods may be misleading. Standard benchmarks primarily measure final answer accuracy on established problems, potentially allowing models to leverage memorized patterns rather than demonstrating true reasoning skills. Their work suggests that more rigorous testing frameworks are needed to properly assess AI’s reasoning capacities.

By examining performance across scaled versions of classic puzzles like Tower of Hanoi and river crossing problems, the researchers documented how model performance degrades dramatically as problem complexity increases. This pattern held true even for models specifically designed to simulate reasoning through chain-of-thought processes.

The study contributes to growing scientific discussion about the fundamental nature of AI capabilities. While these systems demonstrate remarkable performance on many tasks, the research suggests their underlying mechanisms may differ significantly from human-style reasoning. This has important implications for how we develop, evaluate, and ultimately trust AI systems in critical applications.

(Source: Ars Technica)