HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild
Zhiying Zhu, Yiming Yang, Zhiqing Sun

TL;DR
HaluEval-Wild is a new benchmark designed to evaluate hallucinations of large language models in real-world, dynamic user interactions, using adversarially filtered queries from ShareGPT to analyze hallucination types and rates.
Contribution
This paper introduces HaluEval-Wild, the first benchmark specifically targeting LLM hallucinations in real-world settings with a detailed categorization and analysis approach.
Findings
Identified five distinct hallucination types in LLMs.
Evaluated hallucination rates across various LLMs using real-world queries.
Provided insights for improving LLM reliability in practical scenarios.
Abstract
Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from ShareGPT, an existing real-world user-LLM interaction datasets, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Mental Health via Writing · Computational and Text Analysis Methods
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Dropout · Softmax · Dense Connections · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection
