ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for   Tool-Augmented Large Language Models

Yuxiang Zhang; Jing Chen; Junjie Wang; Yaxin Liu; Cheng Yang; Chufan; Shi; Xinyu Zhu; Zihao Lin; Hanwen Wan; Yujiu Yang; Tetsuya Sakai; Tian Feng,; Hayato Yamana

arXiv:2406.20015·cs.CL·October 7, 2024

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan, Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng,, Hayato Yamana

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

ToolBeHonest introduces a comprehensive benchmark, ToolBH, to evaluate hallucination issues in tool-augmented large language models across multiple diagnostic levels and scenarios, revealing significant challenges and insights into model performance.

Contribution

The paper presents the first multi-level, scenario-based diagnostic benchmark for assessing hallucinations in tool-augmented LLMs, including a detailed evaluation framework and analysis.

Findings

01

Current models score below 50 on the benchmark.

02

Model performance is influenced by training data and response strategies.

03

Open-weight models perform worse with verbose replies.

Abstract

Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community has yet to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM's hallucinations through two perspectives: depth and breadth. In terms of depth, we propose a multi-level diagnostic process, including (1) solvability detection, (2) solution planning, and (3) missing-tool analysis. For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools. Furthermore, we developed seven tasks and collected 700 evaluation samples through multiple rounds of manual annotation. The results show the significant challenges presented by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

toolbehonest/toolbehonest
noneOfficial

Datasets

Joelzhang/ToolBeHonest
dataset· 60 dl
60 dl

Videos

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models· underline

Taxonomy

TopicsMachine Learning in Healthcare