Persona Jailbreaking in Large Language Models
Jivnesh Sandhan, Fei Cheng, Tushar Sandhan, Yugo Murawaki

TL;DR
This paper uncovers a new vulnerability in large language models where adversarial inputs can manipulate the models' personas, raising concerns for their reliability in sensitive applications.
Contribution
It introduces the task of persona editing and proposes PHISH, a framework that demonstrates how to adversarially steer LLM traits through user inputs in a black-box setting.
Findings
PHISH effectively shifts LLM personas across multiple benchmarks and models.
The attack causes collateral changes in correlated traits and is more effective in multi-turn interactions.
Current guardrails are partially effective but remain vulnerable under sustained attacks.
Abstract
Large Language Models (LLMs) are increasingly deployed in domains such as education, mental health and customer support, where stable and consistent personas are critical for reliability. Yet, existing studies focus on narrative or role-playing tasks and overlook how adversarial conversational history alone can reshape induced personas. Black-box persona manipulation remains unexplored, raising concerns for robustness in realistic interactions. In response, we introduce the task of persona editing, which adversarially steers LLM traits through user-side inputs under a black-box, inference-only setting. To this end, we propose PHISH (Persona Hijacking via Implicit Steering in History), the first framework to expose a new vulnerability in LLM safety that embeds semantically loaded cues into user queries to gradually induce reverse personas. We also define a metric to quantify attack…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsPersona Design and Applications · Ethics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education
