Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
Jiajia Li, Xiaoyu Wen, Zhongtian Ma, Shuyue Hu, Qiaosheng Zhang, Zhen Wang

TL;DR
This paper introduces Persona-Invariant Alignment (PIA), an adversarial self-play framework for safety alignment in large language models, effectively reducing persona-based jailbreak success while maintaining model capabilities.
Contribution
It presents a novel adversarial self-play approach with theoretical grounding, combining Persona Lineage Evolution and Persona-Invariant Consistency Learning for robust safety alignment.
Findings
PICL significantly reduces attack success rate.
PLE efficiently explores high-risk persona spaces.
The framework maintains model capabilities while improving safety.
Abstract
The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
