ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

Junjie Ye; Zhengyin Du; Xuesong Yao; Weijian Lin; Yufei Xu; Zehui Chen; Zaiyuan Wang; Sining Zhu; Zhiheng Xi; Siyu Yuan; Tao Gui; Qi Zhang; Xuanjing Huang; Jiecao Chen

arXiv:2501.02506·cs.CL·May 21, 2025

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, Jiecao Chen

PDF

Open Access 1 Datasets 1 Video

TL;DR

ToolHop is a new benchmark dataset designed to evaluate large language models' multi-hop tool use capabilities, revealing significant challenges and variability among models, with GPT-4o achieving around 49% accuracy.

Contribution

We introduce ToolHop, a comprehensive, query-driven dataset for rigorous evaluation of multi-hop tool use in large language models, including diverse queries, meaningful dependencies, and verifiable answers.

Findings

01

GPT-4o achieves 49.04% accuracy on ToolHop.

02

Significant challenges remain for LLMs in multi-hop tool use.

03

Variations in tool-use strategies across model families were observed.

Abstract

Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

bytedance-research/ToolHop
dataset· 624 dl
624 dl

Videos

ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use· underline

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques