The Case for Repeatable, Open, and Expert-Grounded Hallucination Benchmarks in Large Language Models

Justin D. Norman; Michael U. Rivera; D. Alex Hughes

arXiv:2505.17345·cs.CL·November 6, 2025

The Case for Repeatable, Open, and Expert-Grounded Hallucination Benchmarks in Large Language Models

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

PDF

TL;DR

This paper advocates for standardized, open benchmarks to measure hallucinations in large language models, emphasizing the importance of expert involvement for valid and useful evaluation metrics.

Contribution

It introduces a taxonomy of hallucinations and demonstrates the necessity of expert-grounded benchmarks for accurate assessment of language model hallucinations.

Findings

01

Expert involvement improves hallucination metric validity

02

Open benchmarks enable consistent evaluation across models

03

Taxonomy helps categorize different types of hallucinations

Abstract

Plausible, but inaccurate, tokens in model-generated text are widely believed to be pervasive and problematic for the responsible adoption of language models. Despite this concern, there is little scientific work that attempts to measure the prevalence of language model hallucination in a comprehensive way. In this paper, we argue that language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking. We present a taxonomy of hallucinations alongside a case study that demonstrates that when experts are absent from the early stages of data creation, the resulting hallucination metrics lack validity and practical utility.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.