Persona-Conditioned Adversarial Prompting (PCAP): Multi-Identity Red-Teaming for Enhanced Adversarial Prompt Discovery
Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith

TL;DR
This paper introduces PCAP, a novel adversarial prompting method that conditions on attacker personas to improve the discovery of diverse, transferable jailbreak attacks, significantly increasing attack success rates.
Contribution
The paper presents a new persona-conditioned adversarial prompting approach that enhances attack diversity and success in red-teaming large language models.
Findings
ASR on GPT-OSS 120B increased from ~58% to ~97%.
PCAP discovers more diverse and transferable jailbreaks.
Method is orthogonal to existing search algorithms.
Abstract
Existing automated red-teaming pipelines often miss attacks that depend on attacker identity, framing, or multi-turn tactics. This under-coverage underestimates real-world risk. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on attacker personas and strategy cards and runs parallel persona-conditioned beam searches to discover diverse, transferable jailbreaks. PCAP is orthogonal to the underlying search algorithm and substantially increases attack success rate (ASR) and prompt diversity (e.g., ASR on GPT-OSS~120B from ), improving attack strategy coverage and diversity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
