Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Ishaan Kelkar; Nebras Alam; Vikram Kakaria; Madhur Panwar; Vasu Sharma; Maheep Chaudhary

arXiv:2605.21006·cs.AI·May 21, 2026

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Ishaan Kelkar, Nebras Alam, Vikram Kakaria, Madhur Panwar, Vasu Sharma, Maheep Chaudhary

PDF

1 Repo

TL;DR

This paper demonstrates that off-the-shelf persona vectors can effectively reduce model sycophancy, rivaling traditional methods, and highlights sycophancy as a persona-level property rather than a single steerable direction.

Contribution

It shows that pre-existing persona vectors can mitigate sycophancy in instruction-tuned models, offering an alternative to trained steering methods and revealing the geometric independence of sycophancy.

Findings

01

Steering toward doubt or scrutiny personas reduces sycophancy by 68-98% of CAA's effect.

02

Persona steering maintains accuracy when the user is correct.

03

Sycophancy is more a persona-level property than a single steerable direction.

Abstract

We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68%$ and $98%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/Sycophancy-Steering-9DF0
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.