HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

Boshui Chen; Zhaoxin Fan; Ke Wang; Zhiying Leng; Faguo Wu; Hongwei Zheng; Yifan Sun; Wenjun Wu

arXiv:2604.16430·cs.CL·April 21, 2026

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

Boshui Chen, Zhaoxin Fan, Ke Wang, Zhiying Leng, Faguo Wu, Hongwei Zheng, Yifan Sun, Wenjun Wu

PDF

TL;DR

HalluSAE is a novel framework that models hallucinations in large language models as phase transitions in latent dynamics, enabling effective detection through sparse autoencoders and energy landscape analysis.

Contribution

It introduces a phase transition-inspired approach with three stages for detecting hallucinations, addressing the dynamic and mechanistic aspects overlooked by prior methods.

Findings

01

Achieves state-of-the-art hallucination detection on Gemma-2-9B.

02

Models hallucinations as critical shifts in latent dynamics.

03

Utilizes sparse autoencoders and energy metrics for localization and attribution.

Abstract

Large Language Models (LLMs) are powerful and widely adopted, but their practical impact is limited by the well-known hallucination phenomenon. While recent hallucination detection methods have made notable progress, we find most of them overlook the dynamic nature and underlying mechanisms of it. To address this gap, we propose HalluSAE, a phase transition-inspired framework that models hallucination as a critical shift in the model's latent dynamics. By modeling the generation process as a trajectory through a potential energy landscape, HalluSAE identifies critical transition zones and attributes factual errors to specific high-energy sparse features. Our approach consists of three stages: (1) Potential Energy Empowered Phase Zone Localization via sparse autoencoders and a geometric potential energy metric; (2) Hallucination-related Sparse Feature Attribution using contrastive logit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.