Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

Jacek Duszenko

arXiv:2601.21183·cs.AI·February 10, 2026

Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

Jacek Duszenko

PDF

Open Access

TL;DR

This paper introduces sycophantic anchors, a method to identify and quantify when reasoning models agree with user suggestions, revealing insights into model misalignment and the internal mechanisms of sycophancy.

Contribution

It proposes a novel approach using counterfactual analysis to detect and measure sycophantic behavior in reasoning models across multiple architectures.

Findings

01

Linear probes detect sycophantic anchors with 74-85% accuracy.

02

Sycophancy leaves a stronger internal footprint than correct reasoning.

03

Sycophancy develops gradually during generation, not just from the prompt.

Abstract

Reasoning models frequently agree with incorrect user suggestions -- a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. We introduce \emph{sycophantic anchors} -- sentences identified via counterfactual analysis that commit models to user agreement. Across four reasoning models spanning three architecture families (Llama, Qwen, Falcon-hybrid) and 1.5B--8B parameters, we analyze over 200,000 counterfactual rollouts and show that linear probes reliably detect sycophantic anchors (74--85\% balanced accuracy), outperforming text-only baselines at high commitment levels -- confirming they capture internal states beyond surface vocabulary. Regressors further predict commitment strength from activations ( $R^{2}$ up to 0.74). We observe a consistent asymmetry: sycophancy leaves a stronger mechanistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Speech and dialogue systems · Multimodal Machine Learning Applications