TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
Yize Li, Junzhi Li, Jason Song, Chuxiong Sun, Rui Wang, and Changwen Zheng

TL;DR
TIDE-Bench is a comprehensive benchmark designed to evaluate tool-integrated reasoning in large language models across diverse, challenging tasks with a focus on efficiency and diagnostic insights.
Contribution
It introduces a unified, multi-faceted evaluation framework with new tasks, comprehensive metrics, and high-quality datasets to advance TIR research.
Findings
Models show persistent challenges in tool grounding.
TIDE-Bench reduces evaluation costs while maintaining discriminative power.
Experiments reveal bottlenecks in multi-tool coordination.
Abstract
Tool-integrated reasoning has emerged as a promising paradigm for enhancing large language models with external computation, retrieval, and execution capabilities. However, the field still lacks a high-quality and unified evaluation benchmark, and existing TIR evaluations remain limited in dataset quality, task diversity, diagnostic comprehensiveness, and evaluation efficiency. In this work, we introduce TIDE-Bench, a holistic and efficient benchmark for evaluating TIR methods, featuring three key advantages. First, it provides diverse task settings, combining widely used mathematical reasoning and knowledge-intensive QA tasks with two newly designed tasks, namely the tool-grounded experimental design task and the dynamic interactive task, to probe models' abilities in complex tool invocation and multi-tool coordination. Second, TIDE-Bench adopts a comprehensive yet task-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
