HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
Abhilasha Ravichander, Shrusti Ghela, David Wadden, Yejin Choi

TL;DR
This paper introduces HALoGEN, a comprehensive benchmark with automatic verifiers to measure hallucinations in large language models across multiple domains, revealing high hallucination rates even in top models.
Contribution
The work provides a new benchmark and automatic verification framework for quantifying and analyzing hallucinations in LLMs, enabling more trustworthy model development.
Findings
Up to 86% hallucination rate in some models
Automatic verifiers effectively decompose and verify model outputs
Hallucinations stem from incorrect recollection, knowledge, or fabrication
Abstract
Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMental Health and Psychiatry · Biofield Effects and Biophysics · Hallucinations in medical conditions
