MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
Pei Wang, Yanan Wu, Zekun Wang, Jiaheng Liu, Xiaoshuai Song, Zhongyuan, Peng, Ken Deng, Chenchen Zhang, Jiakai Wang, Junran Peng, Ge Zhang, Hangyu, Guo, Zhaoxiang Zhang, Wenbo Su, Bo Zheng

TL;DR
MTU-Bench is a comprehensive, multi-granularity benchmark for evaluating large language models' tool-use capabilities across diverse scenarios without relying on costly external evaluations.
Contribution
The paper introduces MTU-Bench, a novel multi-granularity tool-use benchmark covering diverse scenarios and evaluation metrics based solely on prediction results, addressing limitations of existing datasets.
Findings
Demonstrates effectiveness of MTU-Bench through extensive experiments.
Covers five distinct tool usage scenarios.
Provides an instruction dataset to enhance LLM tool-use abilities.
Abstract
Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Linear Layer · Multi-Head Attention · Dense Connections · Residual Connection · Dropout · Layer Normalization · Linear Warmup With Cosine Annealing · Adam
