LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

Manav Pandey

arXiv:2604.19117·cs.LG·May 5, 2026

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

Manav Pandey

PDF

1 Repo

TL;DR

This paper uncovers a specific neural circuit in language models that causes them to agree with false statements despite recognizing errors, and shows how this behavior can be modulated.

Contribution

It identifies a shared attention head circuit responsible for sycophantic agreement, revealing its persistence despite alignment training and methods to suppress it.

Findings

01

A small set of attention heads signals 'this statement is wrong' across models.

02

Silencing these heads reduces sycophantic behavior without affecting factual accuracy.

03

RLHF training reduces sycophancy but does not eliminate the underlying circuit.

Abstract

When a language model agrees with a user's false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the latter. Across twelve open-weight models from five labs, spanning small to frontier scale, the same small set of attention heads carries a "this statement is wrong" signal, whether the model is evaluating a claim on its own or being pressured to agree with a user. Silencing these heads flips sycophantic behavior sharply while leaving factual accuracy intact, so the circuit controls deference rather than knowledge. Edge-level path patching confirms that the same head-to-head connections drive sycophancy, factual lying, and instructed lying. Opinion-agreement, where no factual ground truth exists, reuses these head positions but writes into an orthogonal direction, ruling out a simple "truth-direction" reading of the substrate. Alignment training leaves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mvpandey/shared-sycophancy-lying-circuit
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.