TL;DR
This paper formalizes the jailbreak oracle problem for LLMs, introduces Boa, a system for efficient vulnerability testing, and enables systematic security assessments of language models.
Contribution
It presents the first system, Boa, for efficiently solving the jailbreak oracle problem, advancing systematic LLM safety testing methods.
Findings
Boa enables rigorous security assessments of LLMs.
Systematic comparison of red team attacks is possible.
Model certification under adversarial conditions is facilitated.
Abstract
As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the jailbreak oracle problem: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges, as the search space grows exponentially with response length. We present Boa, the first system designed for efficiently solving the jailbreak oracle problem. Boa employs a two-phase search strategy: (1) breadth-first sampling to identify easily accessible jailbreaks, followed by (2) depth-first priority search guided by fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
