Understanding Tool-Integrated Reasoning
Heng Lin, Zhongwen Xu

TL;DR
This paper provides a formal proof that Tool-Integrated Reasoning (TIR) enhances Large Language Models' capabilities by expanding their support and problem-solving strategies, supported by experiments showing significant performance improvements.
Contribution
It introduces a formal theoretical framework explaining why TIR improves LLM reasoning and proposes ASPO, a novel algorithm for better tool-guided policy optimization.
Findings
TIR significantly outperforms pure-text models on mathematical benchmarks.
Tools enable models to solve problems previously intractable or verbose.
ASPO improves tool usage and interactive reasoning in LLMs.
Abstract
We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging…
Peer Reviews
Decision·Submitted to ICLR 2026
ASPO, a new approach is proposed to handle the practical limitations of existing methods for promoting tool usage in the model. Provides the first formal proof explaining why Tool-Integrated Reasoning (TIR) works, demonstrating it strictly expands an LLM's "empirical" and "feasible" support to overcome the "invisible leash" of pure-text models. Advocates for a paradigm shift in viewing LLMs: not as monolithic problem-solvers, but as core reasoning engines that intelligently delegate computatio
The paper's experiments, while strong, are confined to a single tool (Python interpreter), a single problem domain (mathematical reasoning), and a single base model (Qwen3-8B). This limited scope means the conclusions about the universal benefits of TIR, its generalizability to other tools (like search engines) or domains, and the robustness of the ASPO algorithm across different model families and scales are not fully demonstrated. Authors may want to explicitly state about this limitation, and
Overall, I am a bit ambivalent about this paper. I believe that the experimental section is strong, and in particular, the results showing that using a python interpreter strongly improve the reasoning abilities of an LLM a interesting (with the small caveat here, that I believe that all the problems considered in the benchmarks have a numerical answer, hence being probably more prone to reasoning with interpreter than problems that have a symbolic answer. In particular, I am wondering whether t
On the other hand, I am really not convinced by the theoretical results, and I believe that they are a big weakness of the paper. Here are my main concerns with these results. First, I do not believe that the theorems and proofs are "formal" results. For example, some terms are not defined properly, making the theorems vague (eg, what are "non trivial algorithmic problems"?). While I understand the general ideas behind these theorems, I do not believe that they are correct under the current ass
- The theoretical analysis in Section 3, while with concerns on practicality as discussed in the Weaknesses, paves motivation to consider TIR as general reasoning framework, especially considering token efficiency. - The proposed RL method (ASPO) not only enhances downstream performance but also elicits cognitive patterns essential for establishing TIR as a general reasoning framework rather than a mere computational aid. - The experiments are also thoroughly design and conducted not only to ver
- In Section 3.1.2, the proof of strictness (suggesting that tool augmentation can handle a strictly larger class of problems than text-only LLMs) relies solely on random oracle problems. However, random oracles are not representative of the kinds of problems that LLM-based reasoners are intended to solve. Consequently, it remains uncertain whether the identified strictness region meaningfully extends to practical reasoning or real-world computational tasks. - In Section 3.2.2, the proof of stri
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
