Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing
Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy

TL;DR
This study empirically evaluates activation-based probing for hallucination detection in language models, revealing its limitations and highlighting its potential for pre-generation flagging rather than correction.
Contribution
It provides a comprehensive empirical analysis across multiple models, demonstrating the asymmetry in detection versus correction capabilities of activation probes.
Findings
Probes detect hallucinations better in larger models but fail to correct them.
Output-confidence baselines outperform activation probes in raw detection accuracy.
Probe signals are accessible before token generation, enabling pre-generation flagging.
Abstract
Activation-based linear probing is widely proposed as a method for both detecting and correcting hallucinations in autoregressive language models. We present an empirical study across seven models spanning 117M to 7B parameters and three architecture families (GPT-2, Pythia, Qwen-2.5) that documents a robust asymmetry: linear probes can detect hallucination signals with above-chance accuracy in larger models, but activation steering along the probe-derived direction fails to correct hallucinations in 7 of 7 models tested. We further find that output-confidence baselines outperform activation probes on raw detection AUC at every model above 410M parameters, with the gap reaching 0.157 AUC for Pythia-6.9B. The probe's distinguishing value is therefore not detection accuracy but temporal positioning: probe signals are accessible at position zero (before any output tokens are produced),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
