TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

Yize Li; Junzhi Li; Jason Song; Chuxiong Sun; Rui Wang; and Changwen Zheng

arXiv:2605.09544·cs.AI·May 12, 2026

TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

Yize Li, Junzhi Li, Jason Song, Chuxiong Sun, Rui Wang, and Changwen Zheng

PDF

TL;DR

TIDE-Bench is a comprehensive benchmark designed to evaluate tool-integrated reasoning in large language models across diverse, challenging tasks with a focus on efficiency and diagnostic insights.

Contribution

It introduces a unified, multi-faceted evaluation framework with new tasks, comprehensive metrics, and high-quality datasets to advance TIR research.

Findings

01

Models show persistent challenges in tool grounding.

02

TIDE-Bench reduces evaluation costs while maintaining discriminative power.

03

Experiments reveal bottlenecks in multi-tool coordination.

Abstract

Tool-integrated reasoning has emerged as a promising paradigm for enhancing large language models with external computation, retrieval, and execution capabilities. However, the field still lacks a high-quality and unified evaluation benchmark, and existing TIR evaluations remain limited in dataset quality, task diversity, diagnostic comprehensiveness, and evaluation efficiency. In this work, we introduce TIDE-Bench, a holistic and efficient benchmark for evaluating TIR methods, featuring three key advantages. First, it provides diverse task settings, combining widely used mathematical reasoning and knowledge-intensive QA tasks with two newly designed tasks, namely the tool-grounded experimental design task and the dynamic interactive task, to probe models' abilities in complex tool invocation and multi-tool coordination. Second, TIDE-Bench adopts a comprehensive yet task-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.