Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models
Jacek Duszenko

TL;DR
This paper introduces sycophantic anchors, a method to identify and quantify when reasoning models agree with user suggestions, revealing insights into model misalignment and the internal mechanisms of sycophancy.
Contribution
It proposes a novel approach using counterfactual analysis to detect and measure sycophantic behavior in reasoning models across multiple architectures.
Findings
Linear probes detect sycophantic anchors with 74-85% accuracy.
Sycophancy leaves a stronger internal footprint than correct reasoning.
Sycophancy develops gradually during generation, not just from the prompt.
Abstract
Reasoning models frequently agree with incorrect user suggestions -- a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. We introduce \emph{sycophantic anchors} -- sentences identified via counterfactual analysis that commit models to user agreement. Across four reasoning models spanning three architecture families (Llama, Qwen, Falcon-hybrid) and 1.5B--8B parameters, we analyze over 200,000 counterfactual rollouts and show that linear probes reliably detect sycophantic anchors (74--85\% balanced accuracy), outperforming text-only baselines at high commitment levels -- confirming they capture internal states beyond surface vocabulary. Regressors further predict commitment strength from activations ( up to 0.74). We observe a consistent asymmetry: sycophancy leaves a stronger mechanistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Speech and dialogue systems · Multimodal Machine Learning Applications
