Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak   Attacks

Tom Gibbs; Ethan Kosak-Hine; George Ingebretsen; Jason Zhang; Julius; Broomfield; Sara Pieri; Reihaneh Iranmanesh; Reihaneh Rabbany; Kellin Pelrine

arXiv:2409.00137·cs.CR·September 4, 2024

Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Jason Zhang, Julius, Broomfield, Sara Pieri, Reihaneh Iranmanesh, Reihaneh Rabbany, Kellin Pelrine

PDF

Open Access 2 Datasets

TL;DR

This paper introduces a dataset to study vulnerabilities in large language models against jailbreak attacks in both single and multi-turn formats, revealing that defenses effective in one setting may not work in the other.

Contribution

It provides a new dataset for evaluating jailbreak vulnerabilities in LLMs across different input structures, highlighting the need for comprehensive defenses.

Findings

01

Jailbreak success varies between single and multi-turn inputs.

02

Filter guardrails perform differently depending on input structure.

03

Studying both input formats is crucial for robust LLM defenses.

Abstract

Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we introduce a dataset of jailbreaks where each example can be input in both a single or a multi-turn format. We show that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee defense against the other. Similarly, LLM-based filter guardrails also perform differently depending on not just the input content but the input structure. Thus, vulnerabilities of frontier models should be studied in both single and multi-turn settings; this dataset provides a tool to do so.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning