MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety
Jialin Song, Xiaodong Liu, Weiwei Yang, Wuyang Chen, Mingqian Feng, Xuekai Zhu, Jianfeng Gao

TL;DR
MultiBreak is a comprehensive, multi-turn jailbreak benchmark with over 10,000 prompts that reveals nuanced vulnerabilities in large language models by simulating realistic conversational attacks.
Contribution
It introduces a novel active learning pipeline to generate diverse, high-quality multi-turn adversarial prompts, significantly expanding the scope of safety evaluation benchmarks.
Findings
MultiBreak includes 10,389 prompts covering 2,665 harmful intents.
Achieves up to 54% higher attack success rate than existing datasets.
Diverse attack categories reveal vulnerabilities not seen in single-turn scenarios.
Abstract
We present MultiBreak, a scalable and diverse multi-turn jailbreak benchmark to evaluate large language model (LLM) safety. Multi-turn jailbreaks mimic natural conversational settings, making them easier to bypass safety-aligned LLM than single-turn jailbreaks. Existing multi-turn benchmarks are limited in size or rely heavily on templates, which restrict their diversity. To address this gap, we unify a wide range of harmful jailbreak intents, and introduce an active learning pipeline for expanding high-quality multi-turn adversarial prompts, where a generator is iteratively fine-tuned to produce stronger attack candidates, guided by uncertainty-based refinement. Our MultiBreak includes 10,389 multi-turn adversarial prompts, spans 2,665 distinct harmful intents, and covers the most diverse set of topics to date. Empirical evaluation shows that our benchmark achieves up to a 54.0 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
