HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

Zhiying Zhu; Yiming Yang; Zhiqing Sun

arXiv:2403.04307·cs.CL·September 17, 2024·3 cites

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

Zhiying Zhu, Yiming Yang, Zhiqing Sun

PDF

Open Access 1 Repo

TL;DR

HaluEval-Wild is a new benchmark designed to evaluate hallucinations of large language models in real-world, dynamic user interactions, using adversarially filtered queries from ShareGPT to analyze hallucination types and rates.

Contribution

This paper introduces HaluEval-Wild, the first benchmark specifically targeting LLM hallucinations in real-world settings with a detailed categorization and analysis approach.

Findings

01

Identified five distinct hallucination types in LLMs.

02

Evaluated hallucination rates across various LLMs using real-world queries.

03

Provided insights for improving LLM reliability in practical scenarios.

Abstract

Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from ShareGPT, an existing real-world user-LLM interaction datasets, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

halueval-wild/halueval-wild
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Mental Health via Writing · Computational and Text Analysis Methods

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Dropout · Softmax · Dense Connections · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection