Multi-Turn Jailbreaks Are Simpler Than They Seem
Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz

TL;DR
This paper empirically analyzes multi-turn jailbreak attacks on advanced LLMs, revealing they are often simpler than perceived and comparable to repeated single-turn attacks, with implications for AI safety.
Contribution
It demonstrates that multi-turn jailbreaks are less complex than believed, equating to repeated single-turn attacks, and highlights model similarities and reasoning effort effects.
Findings
Multi-turn jailbreaks are roughly equivalent to resampling single-turn attacks.
Attack success correlates among similar models, easing jailbreaks on new models.
Higher reasoning effort can increase attack success rates.
Abstract
While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized for single-turn protection. This work presents an empirical analysis of automated multi-turn jailbreak attacks across state-of-the-art models including GPT-4, Claude, and Gemini variants, using the StrongREJECT benchmark. Our findings challenge the perceived sophistication of multi-turn attacks: when accounting for the attacker's ability to learn from how models refuse harmful requests, multi-turn jailbreaking approaches are approximately equivalent to simply resampling single-turn attacks multiple times. Moreover, attack success is correlated among similar models, making it easier to jailbreak newly released ones. Additionally, for reasoning models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Topic Modeling
