MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

Jialin Song; Xiaodong Liu; Weiwei Yang; Wuyang Chen; Mingqian Feng; Xuekai Zhu; Jianfeng Gao

arXiv:2605.01687·cs.CL·May 5, 2026

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

Jialin Song, Xiaodong Liu, Weiwei Yang, Wuyang Chen, Mingqian Feng, Xuekai Zhu, Jianfeng Gao

PDF

TL;DR

MultiBreak is a comprehensive, multi-turn jailbreak benchmark with over 10,000 prompts that reveals nuanced vulnerabilities in large language models by simulating realistic conversational attacks.

Contribution

It introduces a novel active learning pipeline to generate diverse, high-quality multi-turn adversarial prompts, significantly expanding the scope of safety evaluation benchmarks.

Findings

01

MultiBreak includes 10,389 prompts covering 2,665 harmful intents.

02

Achieves up to 54% higher attack success rate than existing datasets.

03

Diverse attack categories reveal vulnerabilities not seen in single-turn scenarios.

Abstract

We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expanding high-quality multi-turn adversarial prompts, where a generator is iteratively fine-tuned to produce stronger attack candidates, guided by uncertainty-based refinement. Our MultiBreak includes 10,389 multi-turn adversarial prompts, spans 2,665 distinct harmful intents, and covers the most diverse set of topics to date. Empirical evaluation shows that our benchmark achieves up to a 54.0 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.