Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Arya Shah, Deepali Mishra, Chaklam Silpasuwanchai

TL;DR
This study investigates how the agreeableness of personas in language models influences their tendency to exhibit sycophantic behavior, revealing a strong correlation across multiple models.
Contribution
It systematically quantifies the relationship between persona agreeableness and sycophancy in 13 open-weight language models using a new benchmark and extensive prompts.
Findings
9 out of 13 models show significant positive correlation between agreeableness and sycophancy.
Pearson correlation coefficients reach up to 0.87.
Effect sizes as large as Cohen's d = 2.33 indicate a strong relationship.
Abstract
Large language models increasingly serve as conversational agents that adopt personas and role-play characters at user request. This capability, while valuable, raises concerns about sycophancy: the tendency to provide responses that validate users rather than prioritize factual accuracy. While prior work has established that sycophancy poses risks to AI safety and alignment, the relationship between specific personality traits of adopted personas and the degree of sycophantic behavior remains unexplored. We present a systematic investigation of how persona agreeableness influences sycophancy across 13 small, open-weight language models ranging from 0.6B to 20B parameters. We develop a benchmark comprising 275 personas evaluated on NEO-IPIP agreeableness subscales and expose each persona to 4,950 sycophancy-eliciting prompts spanning 33 topic categories. Our analysis reveals that 9 of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
