Sycophancy Hides Linearly in the Attention Heads

Rifo Genadi; Munachiso Nwadike; Nurdaulet Mukhituly; Hilal Alquabeh; Tatsuya Hiraoka; Kentaro Inui

arXiv:2601.16644·cs.CL·January 26, 2026

Sycophancy Hides Linearly in the Attention Heads

Rifo Genadi, Munachiso Nwadike, Nurdaulet Mukhituly, Hilal Alquabeh, Tatsuya Hiraoka, Kentaro Inui

PDF

Open Access 1 Video

TL;DR

This paper investigates how sycophantic behavior in language models is linearly separable within attention heads, revealing that targeted linear interventions can mitigate such biases by exploiting internal attention patterns.

Contribution

It demonstrates that sycophancy signals are most effectively manipulated within specific attention heads and introduces linear probing methods to analyze and reduce this behavior.

Findings

01

Sycophancy signals are linearly separable in attention heads.

02

Probes trained on TruthfulQA transfer to other factual benchmarks.

03

Attention heads attend disproportionately to expressions of user doubt.

Abstract

We find that correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified "truthful" directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Sycophancy Hides Linearly in the Attention Heads· underline

Taxonomy

TopicsEEG and Brain-Computer Interfaces · Neural and Behavioral Psychology Studies · Mind wandering and attention