CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
Shiting Huang, Zhen Fang, Zehui Chen, Siyu Yuan, Junjie Ye, Yu Zeng, Lin Chen, Qi Mao, Feng Zhao

TL;DR
This paper introduces CRITICTOOL, a benchmark for evaluating large language models' ability to handle errors in tool-using scenarios, emphasizing error diagnosis, recovery, and reflection capabilities.
Contribution
It presents a novel benchmark with an evolutionary dataset construction strategy to assess and analyze LLMs' tool error handling and reflection abilities.
Findings
CRITICTOOL effectively reflects real-world tool errors.
LLMs show varied reflection abilities across different models.
The benchmark strategy generalizes well across tasks.
Abstract
The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
