TL;DR
This paper demonstrates that off-the-shelf persona vectors can effectively reduce model sycophancy, rivaling traditional methods, and highlights sycophancy as a persona-level property rather than a single steerable direction.
Contribution
It shows that pre-existing persona vectors can mitigate sycophancy in instruction-tuned models, offering an alternative to trained steering methods and revealing the geometric independence of sycophancy.
Findings
Steering toward doubt or scrutiny personas reduces sycophancy by 68-98% of CAA's effect.
Persona steering maintains accuracy when the user is correct.
Sycophancy is more a persona-level property than a single steerable direction.
Abstract
We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately and of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
