Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
Wenkai Li, Fan Yang, Shaunak A. Mehta, Koichi Onoue

TL;DR
This study reveals that safety evaluations of persona-imbued LLMs are incomplete when only using prompt-based methods, as activation steering exposes different vulnerabilities across models.
Contribution
It demonstrates that prompt-based and activation-steering safety vulnerabilities differ significantly and are architecture-dependent, highlighting the need for multi-method evaluations.
Findings
Persona danger rankings are consistent across architectures under system prompting.
Activation steering vulnerability diverges sharply and cannot be predicted from prompt rankings.
The prosocial persona paradox shows safety inversions between prompt and activation steering.
Abstract
Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures (--), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings: Llama-3.1-8B is substantially more AS-vulnerable, whereas Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The most striking illustration of this divergence is the *prosocial persona paradox*: on Llama-3.1-8B, P12 (high conscientiousness + high agreeableness) is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
