WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large   Language Models

Kangyun Ning; Yisong Su; Xueqiang Lv; Yuanzhe Zhang; Jian Liu; Kang; Liu; Jinan Xu

arXiv:2407.12823·cs.CL·July 19, 2024·1 cites

WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models

Kangyun Ning, Yisong Su, Xueqiang Lv, Yuanzhe Zhang, Jian Liu, Kang, Liu, Jinan Xu

PDF

Open Access

TL;DR

This paper introduces WTU-Eval, a benchmark for evaluating whether large language models can accurately decide when to use external tools, highlighting their struggles and improvements through fine-tuning.

Contribution

The paper presents WTU-Eval, a new benchmark with datasets to assess LLMs' ability to decide on tool usage, and demonstrates how fine-tuning improves their decision-making accuracy.

Findings

01

LLMs often struggle to determine when to use tools in general datasets.

02

Performance improves when LLMs have capabilities similar to ChatGPT.

03

Fine-tuning Llama2-7B reduces incorrect tool usage by 16.8%.

Abstract

Although Large Language Models (LLMs) excel in NLP tasks, they still need external tools to extend their ability. Current research on tool learning with LLMs often assumes mandatory tool use, which does not always align with real-world situations, where the necessity for tools is uncertain, and incorrect or unnecessary use of tools can damage the general abilities of LLMs. Therefore, we propose to explore whether LLMs can discern their ability boundaries and use tools flexibly. We then introduce the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to assess LLMs with eleven datasets, where six of them are tool-usage datasets, and five are general datasets. LLMs are prompted to use tools according to their needs. The results of eight LLMs on WTU-Eval reveal that LLMs frequently struggle to determine tool use in general datasets, and LLMs' performance in tool-usage datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsALIGN