SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy
Joy Bhalla, Kristina Gligori\'c

TL;DR
This paper introduces SWAY, an unsupervised metric to measure language model sycophancy and proposes a counterfactual mitigation strategy that significantly reduces sycophantic behavior without harming responsiveness.
Contribution
The paper develops a novel computational linguistic metric for sycophancy and introduces an effective counterfactual mitigation approach to reduce it in language models.
Findings
Sycophancy increases with epistemic commitment.
Counterfactual CoT mitigation reduces sycophancy to near zero.
Baseline anti-sycophantic instructions can backfire.
Abstract
Large language models exhibit sycophancy: the tendency to shift outputs toward user-expressed stances, regardless of correctness or consistency. While prior work has studied this issue and its impacts, rigorous computational linguistic metrics are needed to identify when models are being sycophantic. Here, we introduce SWAY, an unsupervised computational linguistic measure of sycophancy. We develop a counterfactual prompting mechanism to identify how much a model's agreement shifts under positive versus negative linguistic pressure, isolating framing effects from content. Applying this metric to benchmark 6 models, we find that sycophancy increases with epistemic commitment. Leveraging our metric, we introduce a counterfactual mitigation strategy teaching models to consider what the answer would be if opposite assumptions were suggested. While baseline mitigation instructing to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
