HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile   Device Scenarios

Jun Wang; Jiamu Zhou; Muning Wen; Xiaoyun Mo; Haoyu Zhang; and Qiqiang Lin; Cheng Jin; Xihuai Wang; Weinan Zhang; Qiuying; Peng; Jun Wang

arXiv:2412.16516·cs.CL·February 18, 2025

HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios

Jun Wang, Jiamu Zhou, Muning Wen, Xiaoyun Mo, Haoyu Zhang, and Qiqiang Lin, Cheng Jin, Xihuai Wang, Weinan Zhang, Qiuying, Peng, Jun Wang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

HammerBench is a comprehensive benchmark framework designed to evaluate large language models' function-calling abilities in realistic multi-turn mobile assistant scenarios, addressing the complexity of user interactions and external information use.

Contribution

We introduce HammerBench, a new benchmark with detailed metrics and datasets for assessing LLMs' function-calling in real-world mobile assistant dialogues, including diverse interaction challenges.

Findings

01

Different parameter name errors significantly impact performance.

02

Performance varies across interaction scenarios, highlighting robustness issues.

03

HammerBench effectively reveals key failure modes in LLM function calling.

Abstract

Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark framework for assessing LLMs' function-calling capabilities in real-world, multi-turn dialogues. HammerBench simulates diverse mobile assistant use cases, incorporating imperfect instructions, dynamic question-answer trajectories, intent and argument shifts, and the indirect use of external information through pronouns. To construct this benchmark, we curate a comprehensive dataset derived from popular mobile app functionalities and anonymized user logs, complemented by a cost-effective data generation pipeline leveraging open-source models. HammerBench is further augmented with fine-grained interaction snapshots and metrics, enabling detailed evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

madeagents/hammerbench
noneOfficial

Datasets

MadeAgents/HammerBench
dataset· 141 dl
141 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGreen IT and Sustainability · Context-Aware Activity Recognition Systems · Interactive and Immersive Displays