Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
Qiyao Liang, Risto Miikkulainen, Ila Fiete

TL;DR
This paper presents a geometric framework for understanding transformer memory failures, showing how conflict and hallucination relate to attractor basins in hidden state space, and proposes a margin-based detection method.
Contribution
It introduces a unified geometric account of memory conflicts and hallucinations in language models, validated through synthetic and natural language experiments.
Findings
Attractor basins explain conflict and hallucination in hidden states.
Margin-based detection outperforms entropy-based methods.
Structural geometric properties persist across model scales.
Abstract
Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
