CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

Shiting Huang; Zhen Fang; Zehui Chen; Siyu Yuan; Junjie Ye; Yu Zeng; Lin Chen; Qi Mao; Feng Zhao

arXiv:2506.13977·cs.SE·June 18, 2025

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

Shiting Huang, Zhen Fang, Zehui Chen, Siyu Yuan, Junjie Ye, Yu Zeng, Lin Chen, Qi Mao, Feng Zhao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CRITICTOOL, a benchmark for evaluating large language models' ability to handle errors in tool-using scenarios, emphasizing error diagnosis, recovery, and reflection capabilities.

Contribution

It presents a novel benchmark with an evolutionary dataset construction strategy to assess and analyze LLMs' tool error handling and reflection abilities.

Findings

01

CRITICTOOL effectively reflects real-world tool errors.

02

LLMs show varied reflection abilities across different models.

03

The benchmark strategy generalizes well across tasks.

Abstract

The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shellorley0513/critictool
noneOfficial

Videos

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios· underline

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques