MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

Xinkai Zhang; Zhipeng Wei; Huanli Gong; Jing Ting Zheng; Yuchen Zhang; Yue Dong; N. Benjamin Erichson

arXiv:2605.11002·cs.CR·May 13, 2026

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng, Yuchen Zhang, Yue Dong, N. Benjamin Erichson

PDF

TL;DR

MT-JailBench is a modular benchmarking framework designed to evaluate multi-turn jailbreak attacks on large language models, enabling fair comparison and analysis of attack components.

Contribution

It introduces a flexible, component-based evaluation framework for multi-turn jailbreaks, clarifying the impact of different attack strategies and components.

Findings

01

Resource budgets and evaluation functions significantly affect attack rankings.

02

Prompt generation is the most influential component on attack success.

03

A recomposed attack using top components outperforms individual source attacks.

Abstract

Multi-turn jailbreaks exploit the ability of large language models to accumulate and act on conversational context. Instead of stating a harmful request directly, an attacker can gradually steer the conversation toward an unsafe answer. Recent methods demonstrate this risk, but they are usually evaluated as black-box pipelines with different budgets, judges, retry rules, and strategy generation procedures. As a result, it is often unclear whether reported gains reflect stronger attack mechanisms or different experimental conditions. We introduce MT-JailBench, a modular evaluation framework for benchmarking multi-turn jailbreaks under fixed conditions. MT-JailBench implements each attack as five interacting modules: evaluation function, attack strategy, prompt generation, prompt refinement, and flow control. This design enables fair comparison across attack methods and component-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.