Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on   Large Language Models

Zonghao Ying; Deyue Zhang; Zonglei Jing; Yisong Xiao; Quanchen Zou,; Aishan Liu; Siyuan Liang; Xiangzheng Zhang; Xianglong Liu; Dacheng Tao

arXiv:2502.11054·cs.CL·March 12, 2025

Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models

Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou,, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, Dacheng Tao

PDF

Open Access 1 Repo

TL;DR

This paper introduces Reasoning-Augmented Conversation (RACE), a novel multi-turn jailbreak framework that uses iterative reasoning to effectively bypass safety measures in large language models, achieving high success rates.

Contribution

The paper presents a new multi-turn jailbreak approach leveraging reasoning capabilities, systematic problem translation, and feedback modules to improve attack success rates against LLMs.

Findings

01

Achieves up to 96% increase in attack success rates.

02

Attains 82% and 92% success against OpenAI o1 and DeepSeek R1.

03

Demonstrates effectiveness in complex conversational scenarios.

Abstract

Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ny1024/race
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Deception detection and forensic psychology