Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents
Mingyang Liao, Yichen Wan, shuchen wu, Chenxi Miao, Xin Shen, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang

TL;DR
This paper introduces a training-free dual-cycle framework for safety role-playing agents that enhances persona fidelity and jailbreak resistance by synthesizing attacks and distilling safety knowledge, applicable to various large language models.
Contribution
It proposes a novel training-free dual-cycle self-evolution method combining attack synthesis and safety knowledge distillation for safer, more faithful role-playing LLMs.
Findings
Significant improvements in safety and role fidelity over baselines.
Robust generalization to unseen personas and attack prompts.
Effective across multiple proprietary LLMs.
Abstract
LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work mitigates this issue with training-time solutions (e.g., data curation or alignment-oriented regularization). However, these approaches are costly to maintain as personas and attack strategies evolve, can degrade in-character behavior, and are typically infeasible for frontier closed-weight LLMs. We propose a training-free Dual-Cycle Adversarial Self-Evolution framework with two coupled cycles. A Persona-Targeted Attacker Cycle synthesizes progressively stronger jailbreak prompts, while a Role-Playing Defender Cycle distills observed failures into a hierarchical knowledge base of (i) global safety rules, (ii) persona-grounded constraints, and (iii) safe in-character…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Ethics and Social Impacts of AI · Advanced Malware Detection Techniques
