A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks

Blake Bullwinkel; Mark Russinovich; Ahmed Salem; Santiago Zanella-Beguelin; Daniel Jones; Giorgio Severi; Eugenia Kim; Keegan Hines; Amanda Minnich; Yonatan Zunger; Ram Shankar Siva Kumar

arXiv:2507.02956·cs.CR·July 8, 2025

A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks

Blake Bullwinkel, Mark Russinovich, Ahmed Salem, Santiago Zanella-Beguelin, Daniel Jones, Giorgio Severi, Eugenia Kim, Keegan Hines, Amanda Minnich, Yonatan Zunger, Ram Shankar Siva Kumar

PDF

TL;DR

This paper investigates how multi-turn jailbreak attacks exploit model representations to bypass safety defenses, revealing that such attacks often keep models in benign regions of representation space across multiple turns.

Contribution

It introduces a representation-level analysis of multi-turn jailbreaks, highlighting why existing defenses fail and suggesting the need for new mitigation strategies.

Findings

01

Multi-turn jailbreaks keep models in benign representation regions.

02

Safety-aligned models often misrepresent harmful responses as benign.

03

Single-turn defenses are ineffective against multi-turn attacks.

Abstract

Recent research has demonstrated that state-of-the-art LLMs and defenses remain susceptible to multi-turn jailbreak attacks. These attacks require only closed-box model access and are often easy to perform manually, posing a significant threat to the safe and secure deployment of LLM-based systems. We study the effectiveness of the Crescendo multi-turn jailbreak at the level of intermediate model representations and find that safety-aligned LMs often represent Crescendo responses as more benign than harmful, especially as the number of conversation turns increases. Our analysis indicates that at each turn, Crescendo prompts tend to keep model outputs in a "benign" region of representation space, effectively tricking the model into fulfilling harmful requests. Further, our results help explain why single-turn jailbreak defenses like circuit breakers are generally ineffective against…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.