ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

Hyeonje Choi; Jeongsoo Lee; Hyojun Lee; and Jay-Yoon Lee

arXiv:2602.21265·cs.CL·May 19, 2026

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, and Jay-Yoon Lee

PDF

TL;DR

ToolMATH is a diagnostic benchmark designed to evaluate language models' ability to perform long-horizon tool use in math problems, considering tool availability, robustness, and connectivity.

Contribution

It introduces a controllable, graded tool environment with behavior-conditioned metrics to analyze model adaptability, robustness, and long-term tool chaining capabilities.

Findings

01

Models exhibit varying profiles: reliable tool use, avoidance, or adaptive substitution.

02

ToolMATH reveals how models handle distractors and long tool-call chains.

03

Behavior-conditioned metrics enable detailed diagnostic evaluation.

Abstract

We introduce \ToolMATH, a math-grounded diagnostic benchmark for evaluating long-horizon tool use under controllable tool-catalog conditions. \ToolMATH converts stepwise MATH solutions into reusable Python tools with natural-language descriptions and typed schemas, and pairs each problem with a tool environment requiring sequential tool use, intermediate-output reuse, and logically connected tool-call chains. \ToolMATH controls tool availability and catalog difficulty by constructing gold tools and graded distractors with varying similarity to gold tools. \ToolMATH also incorporates behavior-conditioned metrics, enabling diagnostic evaluation beyond final accuracy. Building on these measurements, \ToolMATH emphasizes three evaluation axes: (1) \emph{Adaptability} measures how much Gold-only success is retained when gold tools are replaced entirely by distractors; (2) \emph{Robustness}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · AI-based Problem Solving and Planning · Logic, Reasoning, and Knowledge