Best-of-N Jailbreaking
John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez,, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma

TL;DR
Best-of-N Jailbreaking is a black-box method that repeatedly perturbs prompts across modalities to successfully elicit harmful responses from AI systems, revealing their vulnerability to simple input variations.
Contribution
Introduces a versatile, modality-agnostic black-box attack method that significantly improves attack success rates by sampling augmented prompts, demonstrating widespread vulnerabilities in AI models.
Findings
Achieves high attack success rates on closed-source models like GPT-4o and Claude 3.5.
Effectively circumvents state-of-the-art defenses such as circuit breakers.
Extends successfully to vision and audio language models using modality-specific augmentations.
Abstract
We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital and Cyber Forensics
