A Survey of Automatic Hallucination Evaluation on Natural Language Generation
Siya Qi, Lin Gui, Yulan He, Zheng Yuan

TL;DR
This survey systematically analyzes 105 methods for automatic hallucination evaluation in natural language generation, highlighting current limitations and proposing a structured framework and future directions to improve model trustworthiness.
Contribution
It provides a comprehensive taxonomy and framework for evaluating hallucinations in LLMs, addressing fragmentation and guiding future research directions.
Findings
77.1% of methods target LLMs
Identified fundamental limitations in current approaches
Proposed strategic directions for future evaluation systems
Abstract
The rapid advancement of Large Language Models (LLMs) has brought a pressing challenge: how to reliably assess hallucinations to guarantee model trustworthiness. Although Automatic Hallucination Evaluation (AHE) has become an indispensable component of this effort, the field remains fragmented in its methodologies, limiting both conceptual clarity and practical progress. This survey addresses this critical gap through a systematic analysis of 105 evaluation methods, revealing that 77.1% specifically target LLMs, a paradigm shift that demands new evaluation frameworks. We formulate a structured framework to organize the field, based on a survey of foundational datasets and benchmarks and a taxonomy of evaluation methodologies, which together systematically document the evolution from pre-LLM to post-LLM approaches. Beyond taxonomical organization, we identify fundamental limitations in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Digital Mental Health Interventions
