Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
Wen Luo, Guangyue Peng, Wei Li, Shaohang Wei, Feifan Song, Liang Wang, Nan Yang, Xingxing Zhang, Jing Jin, Furu Wei, Houfeng Wang

TL;DR
This paper uncovers two distinct internal pathways in large language models that encode signals of truthfulness, providing insights for improving hallucination detection and model reliability.
Contribution
It identifies and disentangles two mechanisms of truthfulness encoding in LLMs, revealing their properties and potential for enhancing hallucination detection.
Findings
Two pathways are identified: Question-Anchored and Answer-Anchored.
Disentanglement of pathways through attention knockout and token patching.
Pathways are linked to LLM knowledge boundaries and are internally recognized.
Abstract
Despite their impressive capabilities, large language models (LLMs) frequently generate hallucinations. Previous work shows that their internal states encode rich signals of truthfulness, yet the origins and mechanisms of these signals remain unclear. In this paper, we demonstrate that truthfulness cues arise from two distinct information pathways: (1) a Question-Anchored pathway that depends on question-answer information flow, and (2) an Answer-Anchored pathway that derives self-contained evidence from the generated answer itself. First, we validate and disentangle these pathways through attention knockout and token patching. Afterwards, we uncover notable and intriguing properties of these two mechanisms. Further experiments reveal that (1) the two mechanisms are closely associated with LLM knowledge boundaries; and (2) internal representations are aware of their distinctions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
