Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation

Cristian Morasso; Anisa Halimi; Muhammad Zaid Hameed; Douglas Leith

arXiv:2605.11730·cs.LG·May 13, 2026

Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation

Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith

PDF

TL;DR

The paper introduces PCAP, a persona-conditioned adversarial prompting method that enhances red-teaming diversity and effectiveness, leading to improved model robustness through automated vulnerability discovery and mitigation.

Contribution

It proposes a novel persona-conditioned adversarial search technique that uncovers diverse attack scenarios and generates data to significantly improve LLM safety and robustness.

Findings

01

Attack success rate increased from 57% to 97%.

02

Generated prompts are 2-6 times more diverse.

03

Fine-tuning with PCAP data improves robustness metrics from 0.36 to 0.99.

Abstract

Automated red-teaming for LLMs often discovers narrow attack slices, missing diverse real-world threats, and yielding insufficient data for safety fine-tuning. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on diverse attacker personas (e.g., doctors, students, malicious actors) and strategy sets to explore realistic attack scenarios. By running parallel persona-conditioned searches, PCAP discovers transferable jailbreaks across different contexts and generates rich defense datasets with automatic metadata tracking. On GPT-OSS 120B, PCAP increases attack success from 57\% to 97\% while producing 2-6 $\times$ more diverse prompts covering varied real-world scenarios. Critically, fine-tuning lightweight adapters on PCAP-generated data significantly improves model robustness (recall: 0.36 $\to$ 0.99, F1: 0.53 $\to$ 0.96)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.