RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Yifan Jiang; Kriti Aggarwal; Tanmay Laud; Kashif Munir; Jay Pujara; Subhabrata Mukherjee

arXiv:2409.17458·cs.CR·June 10, 2025

RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Yifan Jiang, Kriti Aggarwal, Tanmay Laud, Kashif Munir, Jay Pujara, Subhabrata Mukherjee

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces RED QUEEN ATTACK, a multi-turn jailbreak method exposing vulnerabilities in large language models, and proposes RED QUEEN GUARD to significantly improve their safety against such covert attacks.

Contribution

It presents a novel multi-turn jailbreak approach and a mitigation strategy, addressing the gap in current single-turn attack methods and enhancing LLM security.

Findings

01

All tested LLMs are vulnerable to RED QUEEN ATTACK.

02

Larger models are more susceptible to multi-turn jailbreaks.

03

The proposed mitigation reduces attack success rate below 1%.

Abstract

The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. To bridge this gap, we, first, propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kriti-hippo/red_queen
pytorchOfficial

Datasets

YifanJ/Red_Queen
dataset· 48 dl
48 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Artificial Intelligence in Law