Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks
Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Jason Zhang, Julius, Broomfield, Sara Pieri, Reihaneh Iranmanesh, Reihaneh Rabbany, Kellin Pelrine

TL;DR
This paper introduces a dataset to study vulnerabilities in large language models against jailbreak attacks in both single and multi-turn formats, revealing that defenses effective in one setting may not work in the other.
Contribution
It provides a new dataset for evaluating jailbreak vulnerabilities in LLMs across different input structures, highlighting the need for comprehensive defenses.
Findings
Jailbreak success varies between single and multi-turn inputs.
Filter guardrails perform differently depending on input structure.
Studying both input formats is crucial for robust LLM defenses.
Abstract
Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we introduce a dataset of jailbreaks where each example can be input in both a single or a multi-turn format. We show that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee defense against the other. Similarly, LLM-based filter guardrails also perform differently depending on not just the input content but the input structure. Thus, vulnerabilities of frontier models should be studied in both single and multi-turn settings; this dataset provides a tool to do so.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
