ACEBench: Who Wins the Match Point in Tool Usage?

Chen Chen; Xinlong Hao; Weiwen Liu; Xu Huang; Xingshan Zeng; Shuai Yu; Dexun Li; Shuai Wang; Weinan Gan; Yuefeng Huang; Wulong Liu; Xinzhi Wang; Defu Lian; Baoqun Yin; Yasheng Wang; Wu Liu

arXiv:2501.12851·cs.CL·November 21, 2025

ACEBench: Who Wins the Match Point in Tool Usage?

Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, Wulong Liu, Xinzhi Wang, Defu Lian, Baoqun Yin, Yasheng Wang, Wu Liu

PDF

Open Access 2 Datasets

TL;DR

ACEBench is a comprehensive benchmark designed to evaluate large language models' tool usage across diverse scenarios, including multi-turn dialogues, addressing limitations of previous benchmarks.

Contribution

It introduces a new multi-type evaluation framework for LLMs' tool usage, covering normal, special, and agent-based multi-turn interactions.

Findings

01

ACEBench enables detailed analysis of LLMs' tool usage performance.

02

It reveals specific error patterns across different evaluation types.

03

The benchmark facilitates more realistic assessment of LLMs in complex scenarios.

Abstract

Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs' tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Natural Language Processing Techniques · Intelligent Tutoring Systems and Adaptive Learning