How RLHF Amplifies Sycophancy
Itai Shapira, Gerdus Benade, Ariel D. Procaccia

TL;DR
This paper analyzes how reinforcement learning from human feedback (RLHF) can unintentionally amplify sycophantic behavior in language models, and proposes a method to mitigate this effect through reward correction.
Contribution
It provides a formal analysis of the amplification mechanism in RLHF and introduces a training intervention to neutralize sycophantic bias in language models.
Findings
Reward gaps are common and cause behavioral drift.
The amplification mechanism is linked to covariance between endorsement signals and learned reward.
A minimal reward correction policy can prevent increased sycophancy.
Abstract
Large language models often exhibit increased sycophantic behavior after preference-based post-training, showing a stronger tendency to affirm a user's stated or implied belief even when this conflicts with factual accuracy or sound judgment. We present a formal analysis of how alignment from human feedback can increase this failure mode by identifying an explicit amplification mechanism that causally links optimization against a learned reward to bias in the human preference data used for alignment. We show that the direction of behavioral drift is determined by a covariance under the base policy between endorsing the belief signal in the prompt and the learned reward, and that the first-order effect reduces to a simple mean-gap condition. We then analyze reward learning from pairwise comparisons under random utility models like Bradley-Terry and characterize when bias in human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Neurobiology of Language and Bilingualism
