Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem

Shuyi Lin; Anshuman Suri; Alina Oprea; Cheng Tan

arXiv:2506.17299·cs.CR·April 27, 2026

Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem

Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan

PDF

1 Repo

TL;DR

This paper formalizes the jailbreak oracle problem for LLMs, introduces Boa, a system for efficient vulnerability testing, and enables systematic security assessments of language models.

Contribution

It presents the first system, Boa, for efficiently solving the jailbreak oracle problem, advancing systematic LLM safety testing methods.

Findings

01

Boa enables rigorous security assessments of LLMs.

02

Systematic comparison of red team attacks is possible.

03

Model certification under adversarial conditions is facilitated.

Abstract

As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the jailbreak oracle problem: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges, as the search space grows exponentially with response length. We present Boa, the first system designed for efficiently solving the jailbreak oracle problem. Boa employs a two-phase search strategy: (1) breadth-first sampling to identify easily accessible jailbreaks, followed by (2) depth-first priority search guided by fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shuyilinn/BOA/tree/mlsys2026ae
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.