Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou,, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, Dacheng Tao

TL;DR
This paper introduces Reasoning-Augmented Conversation (RACE), a novel multi-turn jailbreak framework that uses iterative reasoning to effectively bypass safety measures in large language models, achieving high success rates.
Contribution
The paper presents a new multi-turn jailbreak approach leveraging reasoning capabilities, systematic problem translation, and feedback modules to improve attack success rates against LLMs.
Findings
Achieves up to 96% increase in attack success rates.
Attains 82% and 92% success against OpenAI o1 and DeepSeek R1.
Demonstrates effectiveness in complex conversational scenarios.
Abstract
Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Deception detection and forensic psychology
