Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs

Wenkai Li; Fan Yang; Shaunak A. Mehta; Koichi Onoue

arXiv:2604.11120·cs.AI·April 15, 2026

Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs

Wenkai Li, Fan Yang, Shaunak A. Mehta, Koichi Onoue

PDF

TL;DR

This study reveals that safety evaluations of persona-imbued LLMs are incomplete when only using prompt-based methods, as activation steering exposes different vulnerabilities across models.

Contribution

It demonstrates that prompt-based and activation-steering safety vulnerabilities differ significantly and are architecture-dependent, highlighting the need for multi-method evaluations.

Findings

01

Persona danger rankings are consistent across architectures under system prompting.

02

Activation steering vulnerability diverges sharply and cannot be predicted from prompt rankings.

03

The prosocial persona paradox shows safety inversions between prompt and activation steering.

Abstract

Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ( $ρ = 0.71$ -- $0.96$ ), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings: Llama-3.1-8B is substantially more AS-vulnerable, whereas Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The most striking illustration of this divergence is the *prosocial persona paradox*: on Llama-3.1-8B, P12 (high conscientiousness + high agreeableness) is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.