ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, and Jay-Yoon Lee

TL;DR
ToolMATH is a diagnostic benchmark designed to evaluate language models' ability to perform long-horizon tool use in math problems, considering tool availability, robustness, and connectivity.
Contribution
It introduces a controllable, graded tool environment with behavior-conditioned metrics to analyze model adaptability, robustness, and long-term tool chaining capabilities.
Findings
Models exhibit varying profiles: reliable tool use, avoidance, or adaptive substitution.
ToolMATH reveals how models handle distractors and long tool-call chains.
Behavior-conditioned metrics enable detailed diagnostic evaluation.
Abstract
We introduce \ToolMATH, a math-grounded diagnostic benchmark for evaluating long-horizon tool use under controllable tool-catalog conditions. \ToolMATH converts stepwise MATH solutions into reusable Python tools with natural-language descriptions and typed schemas, and pairs each problem with a tool environment requiring sequential tool use, intermediate-output reuse, and logically connected tool-call chains. \ToolMATH controls tool availability and catalog difficulty by constructing gold tools and graded distractors with varying similarity to gold tools. \ToolMATH also incorporates behavior-conditioned metrics, enabling diagnostic evaluation beyond final accuracy. Building on these measurements, \ToolMATH emphasizes three evaluation axes: (1) \emph{Adaptability} measures how much Gold-only success is retained when gold tools are replaced entirely by distractors; (2) \emph{Robustness}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · AI-based Problem Solving and Planning · Logic, Reasoning, and Knowledge
