ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

Shirley Kokane; Ming Zhu; Tulika Awalgaonkar; Jianguo Zhang; Thai Hoang; Akshara Prabhakar; Zuxin Liu; Tian Lan; Liangwei Yang; Juntao Tan; Rithesh Murthy; Weiran Yao; Zhiwei Liu; Juan Carlos Niebles; Huan Wang; Shelby Heinecke; Caiming Xiong; Silivo Savarese

arXiv:2411.13547·cs.SE·June 27, 2025

ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

Shirley Kokane, Ming Zhu, Tulika Awalgaonkar, Jianguo Zhang, Thai Hoang, Akshara Prabhakar, Zuxin Liu, Tian Lan, Liangwei Yang, Juntao Tan, Rithesh Murthy, Weiran Yao, Zhiwei Liu, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong, Silivo Savarese

PDF

Open Access

TL;DR

ToolScan is a new benchmark designed to identify and analyze specific error patterns in LLMs during tool-use tasks, providing detailed insights beyond simple success rates to improve error mitigation.

Contribution

Introduces TOOLSCAN, a benchmark dataset that characterizes seven new error patterns in LLMs' tool-use outputs, enhancing error analysis capabilities.

Findings

01

LLMs exhibit diverse error patterns in tool-use tasks.

02

ToolScan reveals error patterns not captured by success rates.

03

Insights from ToolScan can guide error mitigation strategies.

Abstract

Evaluating Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce TOOLSCAN, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using TOOLSCAN, we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use these insights from TOOLSCAN to guide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security

MethodsSparse Evolutionary Training