The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs

Denis Janiak; Jakub Binkowski; Albert Sawczyn; Bogdan Gabrys; Ravid Shwartz-Ziv; Tomasz Kajdanowicz

arXiv:2508.08285·cs.CL·August 15, 2025

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs

Denis Janiak, Jakub Binkowski, Albert Sawczyn, Bogdan Gabrys, Ravid Shwartz-Ziv, Tomasz Kajdanowicz

PDF

1 Video

TL;DR

This paper critically examines current hallucination detection methods in large language models, revealing that prevalent evaluation metrics like ROUGE are misleading and advocating for more semantically aware assessment frameworks.

Contribution

It provides a comprehensive human-centered evaluation of hallucination detection methods, exposing flaws in current metrics and proposing the need for improved, semantically aligned evaluation approaches.

Findings

01

ROUGE exhibits high recall but very low precision in hallucination detection.

02

Detection performance drops significantly when evaluated with human-aligned metrics.

03

Simple heuristics like response length can match complex detection methods.

Abstract

Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9\% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs· underline