PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
Khizar Hussain, Murat Kantarcioglu

TL;DR
This paper critically examines hallucination detection in large language models, revealing that many reported successes are due to dataset artifacts and introducing DRIFT as a more reliable detection method.
Contribution
The study exposes benchmark artifacts affecting hallucination detection evaluation and proposes DRIFT, a supervised probe, as a more genuine detection approach.
Findings
Most detection success is due to dataset artifacts, not model understanding.
Many established baselines perform near chance when artifacts are controlled.
DRIFT and SAPLMA are effective supervised probes for detection.
Abstract
Large language models (LLMs) hallucinate with confidence: their outputs can be fluent, authoritative, and simply wrong. In medical, legal, and scientific applications this failure causes direct harm, and detecting it from internal model states offers a path to safer deployment. A growing body of work reports that this problem is increasingly tractable, with recent methods achieving high detection performance on widely used benchmarks. We show, however, that much of this apparent progress does not survive scrutiny. Four of the six corpora embed the ground-truth answer directly in the input prompt. A na\"{i}ve text-similarity baseline we call \textsc{TxTemb} exploits this to achieve near-perfect detection scores without any access to model internals. To measure what genuine detection capability remains once these artifacts are controlled, we conduct a large-scale evaluation spanning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
