HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders
Boshui Chen, Zhaoxin Fan, Ke Wang, Zhiying Leng, Faguo Wu, Hongwei Zheng, Yifan Sun, Wenjun Wu

TL;DR
HalluSAE is a novel framework that models hallucinations in large language models as phase transitions in latent dynamics, enabling effective detection through sparse autoencoders and energy landscape analysis.
Contribution
It introduces a phase transition-inspired approach with three stages for detecting hallucinations, addressing the dynamic and mechanistic aspects overlooked by prior methods.
Findings
Achieves state-of-the-art hallucination detection on Gemma-2-9B.
Models hallucinations as critical shifts in latent dynamics.
Utilizes sparse autoencoders and energy metrics for localization and attribution.
Abstract
Large Language Models (LLMs) are powerful and widely adopted, but their practical impact is limited by the well-known hallucination phenomenon. While recent hallucination detection methods have made notable progress, we find most of them overlook the dynamic nature and underlying mechanisms of it. To address this gap, we propose HalluSAE, a phase transition-inspired framework that models hallucination as a critical shift in the model's latent dynamics. By modeling the generation process as a trajectory through a potential energy landscape, HalluSAE identifies critical transition zones and attributes factual errors to specific high-energy sparse features. Our approach consists of three stages: (1) Potential Energy Empowered Phase Zone Localization via sparse autoencoders and a geometric potential energy metric; (2) Hallucination-related Sparse Feature Attribution using contrastive logit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
