TL;DR
This paper critically evaluates existing hallucination detection benchmarks for LLMs, identifies key gaps, and introduces a new RAG-based benchmark, T RIVIA+, with realistic noise and long context samples to improve evaluation.
Contribution
It establishes desiderata for hallucination detection benchmarks, creates the comprehensive T RIVIA+ benchmark, and provides new insights into detector performance and noise impact.
Findings
Current detectors have significant room for improvement.
The LLM-as-a-Judge baseline performs competitively.
Label noise negatively affects detection performance.
Abstract
Hallucination, broadly referring to unfaithful, fabricated, or inconsistent content generated by LLMs, has wide-ranging implications. Therefore, a large body of effort has been devoted to detecting LLM hallucinations, as well as designing benchmark datasets for evaluating these detectors. In this work, we first establish a desiderata of properties for hallucination detection benchmarks (HDBs) to exhibit for effective evaluation. A critical look at existing HDBs through the lens of our desiderata reveals that none of them exhibits all the properties. We identify two largest gaps: (1) RAG-based grounded benchmarks with long context are severely lacking (partly because length impedes human annotation); and (2) Existing benchmarks do not make available realistic label noise for stress-testing detectors although real-world use-cases often grapple with label noise due to human or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
