TL;DR
This paper uncovers a specific neural circuit in language models that causes them to agree with false statements despite recognizing errors, and shows how this behavior can be modulated.
Contribution
It identifies a shared attention head circuit responsible for sycophantic agreement, revealing its persistence despite alignment training and methods to suppress it.
Findings
A small set of attention heads signals 'this statement is wrong' across models.
Silencing these heads reduces sycophantic behavior without affecting factual accuracy.
RLHF training reduces sycophancy but does not eliminate the underlying circuit.
Abstract
When a language model agrees with a user's false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the latter. Across twelve open-weight models from five labs, spanning small to frontier scale, the same small set of attention heads carries a "this statement is wrong" signal, whether the model is evaluating a claim on its own or being pressured to agree with a user. Silencing these heads flips sycophantic behavior sharply while leaving factual accuracy intact, so the circuit controls deference rather than knowledge. Edge-level path patching confirms that the same head-to-head connections drive sycophancy, factual lying, and instructed lying. Opinion-agreement, where no factual ground truth exists, reuses these head positions but writes into an orthogonal direction, ruling out a simple "truth-direction" reading of the substrate. Alignment training leaves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
