How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Xiangkun Sun, Lingkai Kong, Aoqi Zhang, Liang Zeng, Tonghan Wang

TL;DR
This paper uncovers a specific, causal attention-head mechanism in language models that explains how they can be persuaded to change factual answers, revealing a narrow, monitorable circuit.
Contribution
It identifies a compact, causal attention-head mechanism responsible for persuasion in LLMs, enabling targeted interventions to prevent misinformation.
Findings
A small set of attention heads determine the model's answer.
Persuasion causes a discrete jump between answer vertices, not confidence reduction.
Interventions on the identified feature can steer or block persuasion.
Abstract
Language models can be persuaded to abandon factual knowledge. This vulnerability is central to AI safety, but its internal mechanism remains poorly understood. We uncover a compact causal mechanism for persuasion-induced factual errors. A small set of mid-layer attention heads almost entirely determines the model's answer. These heads write answer options into a low-dimensional polyhedron, with options occupying distinct vertices. Persuasion does not blur belief or merely reduce confidence; it causes a discrete latent jump from the correct-answer vertex to the persuasion-target vertex. We show that decision heads are not reasoning over evidence. Instead, they copy whichever option token their attention selects. Persuasion works by redirecting attention. We isolate a rank-one evidence-routing feature that controls the route. Directly modifying this feature steers the model's choice, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
