MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks
Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng, Yuchen Zhang, Yue Dong, N. Benjamin Erichson

TL;DR
MT-JailBench is a modular benchmarking framework designed to evaluate multi-turn jailbreak attacks on large language models, enabling fair comparison and analysis of attack components.
Contribution
It introduces a flexible, component-based evaluation framework for multi-turn jailbreaks, clarifying the impact of different attack strategies and components.
Findings
Resource budgets and evaluation functions significantly affect attack rankings.
Prompt generation is the most influential component on attack success.
A recomposed attack using top components outperforms individual source attacks.
Abstract
Multi-turn jailbreaks exploit the ability of large language models to accumulate and act on conversational context. Instead of stating a harmful request directly, an attacker can gradually steer the conversation toward an unsafe answer. Recent methods demonstrate this risk, but they are usually evaluated as black-box pipelines with different budgets, judges, retry rules, and strategy generation procedures. As a result, it is often unclear whether reported gains reflect stronger attack mechanisms or different experimental conditions. We introduce MT-JailBench, a modular evaluation framework for benchmarking multi-turn jailbreaks under fixed conditions. MT-JailBench implements each attack as five interacting modules: evaluation function, attack strategy, prompt generation, prompt refinement, and flow control. This design enables fair comparison across attack methods and component-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
