Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation
Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith

TL;DR
The paper introduces PCAP, a persona-conditioned adversarial prompting method that enhances red-teaming diversity and effectiveness, leading to improved model robustness through automated vulnerability discovery and mitigation.
Contribution
It proposes a novel persona-conditioned adversarial search technique that uncovers diverse attack scenarios and generates data to significantly improve LLM safety and robustness.
Findings
Attack success rate increased from 57% to 97%.
Generated prompts are 2-6 times more diverse.
Fine-tuning with PCAP data improves robustness metrics from 0.36 to 0.99.
Abstract
Automated red-teaming for LLMs often discovers narrow attack slices, missing diverse real-world threats, and yielding insufficient data for safety fine-tuning. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on diverse attacker personas (e.g., doctors, students, malicious actors) and strategy sets to explore realistic attack scenarios. By running parallel persona-conditioned searches, PCAP discovers transferable jailbreaks across different contexts and generates rich defense datasets with automatic metadata tracking. On GPT-OSS 120B, PCAP increases attack success from 57\% to 97\% while producing 2-6 more diverse prompts covering varied real-world scenarios. Critically, fine-tuning lightweight adapters on PCAP-generated data significantly improves model robustness (recall: 0.36 0.99, F1: 0.53 0.96)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
