TL;DR
This paper investigates how tool use in large language models can lead to reasoning failures, introducing the concept of Tool-Induced Myopia (TIM) and proposing a framework to improve reasoning with tools.
Contribution
It characterizes TIM as a new failure mode in tool-augmented language models and develops a framework to mitigate reasoning degradation caused by tool use.
Findings
Tool use increases answer accuracy but worsens reasoning coherence.
Models shift from arithmetic errors to reasoning failures with more tool use.
The proposed framework improves both accuracy and reasoning depth.
Abstract
Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
