Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition
Muhammad Ahmed Mohsin, Ahsan Bilal, Muhammad Umer, Emily Fox

TL;DR
This paper introduces a reward decomposition method to reduce sycophancy in large language models by disentangling pressure effects from evidence-based responses, leading to more truthful outputs.
Contribution
It proposes the first reward decomposition approach, GRPO, to disentangle pressure and evidence effects, improving model resistance to social pressure in responses.
Findings
Consistently reduces sycophancy across five models and metrics.
Decomposition terms independently govern specific behavioral dimensions.
Resistance to pressure generalizes beyond training conditions, reducing answer priming by up to 17 points.
Abstract
Large language models exhibit sycophancy, the tendency to shift their stated positions toward perceived user preferences or authority cues regardless of evidence. Standard alignment methods fail to correct this because scalar reward models conflate two distinct failure modes into a single signal: pressure capitulation, where the model changes a correct answer under social pressure, and evidence blindness, where the model ignores the provided context entirely. We operationalise sycophancy through formal definitions of pressure independence and evidence responsiveness, serving as a working framework for disentangled training rather than a definitive characterisation of the phenomenon. We propose the first approach to sycophancy reduction via reward decomposition, introducing a multi-component Group Relative Policy Optimisation (GRPO) reward that decomposes the training signal into five…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
