Sycophancy Hides Linearly in the Attention Heads
Rifo Genadi, Munachiso Nwadike, Nurdaulet Mukhituly, Hilal Alquabeh, Tatsuya Hiraoka, Kentaro Inui

TL;DR
This paper investigates how sycophantic behavior in language models is linearly separable within attention heads, revealing that targeted linear interventions can mitigate such biases by exploiting internal attention patterns.
Contribution
It demonstrates that sycophancy signals are most effectively manipulated within specific attention heads and introduces linear probing methods to analyze and reduce this behavior.
Findings
Sycophancy signals are linearly separable in attention heads.
Probes trained on TruthfulQA transfer to other factual benchmarks.
Attention heads attend disproportionately to expressions of user doubt.
Abstract
We find that correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified "truthful" directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Neural and Behavioral Psychology Studies · Mind wandering and attention
