Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

Jiajia Li; Xiaoyu Wen; Zhongtian Ma; Shuyue Hu; Qiaosheng Zhang; Zhen Wang

arXiv:2605.01899·cs.AI·May 5, 2026

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

Jiajia Li, Xiaoyu Wen, Zhongtian Ma, Shuyue Hu, Qiaosheng Zhang, Zhen Wang

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces Persona-Invariant Alignment (PIA), an adversarial self-play framework for safety alignment in large language models, effectively reducing persona-based jailbreak success while maintaining model capabilities.

Contribution

It presents a novel adversarial self-play approach with theoretical grounding, combining Persona Lineage Evolution and Persona-Invariant Consistency Learning for robust safety alignment.

Findings

01

PICL significantly reduces attack success rate.

02

PLE efficiently explores high-risk persona spaces.

03

The framework maintains model capabilities while improving safety.

Abstract

The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JiajiaLi-1130/PIA
github

Models

🤗
XiaoyuWen/PIA
model

Datasets

XiaoyuWen/PIA-Persona-Dataset
dataset· 39 dl
39 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.