MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language   Models

Pei Wang; Yanan Wu; Zekun Wang; Jiaheng Liu; Xiaoshuai Song; Zhongyuan; Peng; Ken Deng; Chenchen Zhang; Jiakai Wang; Junran Peng; Ge Zhang; Hangyu; Guo; Zhaoxiang Zhang; Wenbo Su; Bo Zheng

arXiv:2410.11710·cs.CL·October 16, 2024

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

Pei Wang, Yanan Wu, Zekun Wang, Jiaheng Liu, Xiaoshuai Song, Zhongyuan, Peng, Ken Deng, Chenchen Zhang, Jiakai Wang, Junran Peng, Ge Zhang, Hangyu, Guo, Zhaoxiang Zhang, Wenbo Su, Bo Zheng

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

MTU-Bench is a comprehensive, multi-granularity benchmark for evaluating large language models' tool-use capabilities across diverse scenarios without relying on costly external evaluations.

Contribution

The paper introduces MTU-Bench, a novel multi-granularity tool-use benchmark covering diverse scenarios and evaluation metrics based solely on prediction results, addressing limitations of existing datasets.

Findings

01

Demonstrates effectiveness of MTU-Bench through extensive experiments.

02

Covers five distinct tool usage scenarios.

03

Provides an instruction dataset to enhance LLM tool-use abilities.

Abstract

Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Recently, many tool-use benchmark datasets have been proposed. However, existing datasets have the following limitations: (1). Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes). (2). Extensive evaluation costs (e.g., GPT API costs). To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench. For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks). Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mtu-bench-team/mtu-bench
noneOfficial

Datasets

wpei/MTU-Bench
dataset· 31 dl
31 dl

Videos

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Linear Layer · Multi-Head Attention · Dense Connections · Residual Connection · Dropout · Layer Normalization · Linear Warmup With Cosine Annealing · Adam