TL;DR
This paper introduces ELEPHANT, a benchmark for measuring social sycophancy in LLMs, revealing high levels of face-preservation and inconsistent moral judgments, and explores mitigation strategies.
Contribution
It defines social sycophancy in LLMs, presents a new benchmark for measurement, and evaluates model behaviors and mitigation approaches.
Findings
LLMs preserve user face 45 percentage points more than humans.
LLMs affirm both sides of moral conflicts in 48% of cases.
Model-based steering shows promise in reducing social sycophancy.
Abstract
LLMs are known to exhibit sycophancy: agreeing with and flattering users, even at the cost of correctness. Prior work measures sycophancy only as direct agreement with users' explicitly stated beliefs that can be compared to a ground truth. This fails to capture broader forms of sycophancy such as affirming a user's self-image or other implicit beliefs. To address this gap, we introduce social sycophancy, characterizing sycophancy as excessive preservation of a user's face (their desired self-image), and present ELEPHANT, a benchmark for measuring social sycophancy in an LLM. Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy: on average, they preserve user's face 45 percentage points more than humans in general advice queries and in queries describing clear user wrongdoing (from Reddit's r/AmITheAsshole). Furthermore, when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
