Thinking Isn't an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations
Zhao Song, Song Yue, Jiahao Zhang

TL;DR
This paper demonstrates that large reasoning models, when augmented with tools like Python interpreters and scratchpads, outperform non-reasoning models on complex reasoning tasks, challenging the idea that reasoning is an illusion.
Contribution
It shows that tool augmentation significantly enhances the reasoning capabilities of large reasoning models, reversing previous findings that questioned their effectiveness.
Findings
LRMs outperform non-reasoning models with tool use across all task complexities
Tool augmentations like Python interpreters improve reasoning performance
Challenges the narrative that reasoning processes are ineffective
Abstract
Large Reasoning Models (LRMs) have become a central focus in today's large language model (LLM) research, where models are designed to output a step-by-step thinking process before arriving at a final answer to handle complex reasoning tasks. Despite their promise, recent empirical studies (e.g., [Shojaee et al., 2025] from Apple) suggest that this thinking process may not actually enhance reasoning ability, where LLMs without explicit reasoning actually outperform LRMs on tasks with low or high complexity. In this work, we revisit these findings and investigate whether the limitations of LRMs persist when tool augmentations are introduced. We incorporate two types of tools, Python interpreters and scratchpads, and evaluate three representative LLMs and their LRM counterparts on Apple's benchmark reasoning puzzles. Our results show that, with proper tool use, LRMs consistently…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper provides a timely and relevant rebuttal to the "thinking as illusion" narrative, supported by experimentation. 2. The introduction of a scratchpad mechanism and Python interpreter integration offers a practical and scalable approach to evaluating long-horizon reasoning.
1. Insufficient Experimental Analysis: The results are largely descriptive and lack in-depth analysis—e.g., why certain tools work better than others, or why some tasks remain unsolvable. 2. Writing Quality: While understandable, the writing is often dense and could benefit from clearer transitions, better structuring, and more engaging engaging exposition.
1. The paper provided empirical evidences for a problem that is of interest to the community: comparing performance of reasoning and non-reasoning LLMs. 2. The paper is well written and results are clearly presented. Overall the message conveyed by this paper is clear (though I do not believe it has been proven to be generalisable or of significant technical depth).
1. The experiment does not cover enough models. Only DeepSeek and Qwen-3 LLM families are tested. It is unclear to me whether the result is applicable to other reasoning models, especially close-weight models such as Gemini, GPT, and Claude. 2. The technical contribution of proposed benchmark is weak. Tool-augmented LLMs have long been deployed a long time and a fair amount of empirical evidence has proven that tool usage improves model performance. So the results in this paper does not look su
1. Authors point out an interesting point that is not covered in the apple's thinking-illusion benchmark, and show that LLM with tool usage could weaken the previous work's claim. 2. The paper is easy to read and follow, and the contents are well-organized by each (sub)section. 3. If the author's claim is true, then it will be a big contribution to the community to understand the reasoning model's behavior.
1. The argument in a paragraph in line 68-77 is not convincing to me. It needs more evidence and justifications to be logically sound. 1.1 Token limit constraint is equally applied to both LLMs and LRMs, but can we argue that only LRM's performance is mistakenly measured? (Couldn't LLM's performance be increased more when there is no constraint in token limit?) 1.2 What is the actual percentage of LRM failure leaded by the token limit? -- If this percentage is low, then it is hard to believe
The paper is generally well-written and easy to follow. The experimental setup and key findings are clearly presented, and the design choices appear reasonable. The authors conduct controlled ablations between reasoning and non-reasoning models, use benchmarks with well-defined complexity (directly adapted from [Shojaee et al., 2025] though), and incorporate practical tools such as a Python interpreter and scratchpad. Overall, these settings provide a reasonable foundation for the study’s discus
W1. Misinterpretation of prior work (Shojaee et al., 2025). The paper partially misinterprets and oversimplifies the conclusions of Shojaee et al. (2025). In that work, the authors did not primarily focus on comparing standard LLMs and LRMs. Rather, their key claim was that the reasoning traces produced by LRMs are not consistently reliable across tasks with varying computational complexity—LRMs tend to perform well at medium-level complexity, worse than standard LLMs on low-complexity tasks, a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling
