Sanity Checks for Long-Form Hallucination Detection
Geigh Zollicoffer,Minh Vu,Hongli Zhan,Raymond Li,Manish Bhattarai

TL;DR
This paper introduces a methodology to distinguish whether hallucination detection methods for language models rely on reasoning structures or surface answer cues, revealing that simple lexical features can be effective.
Contribution
It proposes controlled-invariance tests to evaluate hallucination detectors and introduces TRACT, a lightweight lexical feature-based scorer that is robust and competitive.
Findings
Controlled-invariance tests reveal reliance on answer artifacts.
TRACT achieves strong robustness with simple lexical features.
Effective detection does not necessarily require complex models.
Abstract
Hallucination detection methods for large language models increasingly operate on chain-of-thought reasoning traces, yet it remains unclear whether they evaluate the reasoning itself or merely exploit surface correlates of the final answer. We introduce a controlled-invariance methodology that exposes this distinction through two oracle tests: \textsc{Force}, which replaces each response's final answer with the ground truth while preserving the reasoning trace, and \textsc{Remove}, which strips answer-announcement steps while leaving the trajectory intact. This reveals if their predictive power derives from answer-level artifacts rather than from the structure or validity of intermediate reasoning. We further show that once these artifacts are controlled for, effective detection does not necessarily require complex learned representations: TRACT, a lightweight scorer built on lexical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
