Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models
Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh, David Zhang, Eric Hsin, Li Chen, Ankit Jain, Matt Fredrikson, Akash Bharadwaj

TL;DR
Jailbreak-Zero introduces a policy-based red teaming method for LLMs that generates diverse, high-quality adversarial prompts, achieving superior attack success rates with minimal human effort, thus enhancing safety evaluation.
Contribution
It presents a novel, scalable, policy-based red teaming framework that outperforms existing methods in LLM safety testing by leveraging attack LLMs and fine-tuning with preference data.
Findings
Higher attack success rates against GPT-40 and Claude 3.5.
Produces human-readable, effective adversarial prompts.
Achieves Pareto optimality in policy coverage, diversity, and fidelity.
Abstract
This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Topic Modeling
