A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos
Yang Yao, Xuan Tong, Ruofan Wang, Yixu Wang, Lujundong Li, Liang Liu, Yan Teng, Yingchun Wang

TL;DR
This paper introduces Mousetrap, a novel attack framework exploiting the reasoning process of large reasoning models (LRMs) to effectively bypass safety measures with high success rates, revealing inherent vulnerabilities.
Contribution
It presents the first targeted jailbreak attack on LRMs using a Chaos Machine to generate complex, variable prompts that exploit reasoning flaws, significantly improving attack success rates.
Findings
Mousetrap achieves up to 98% success on various LRMs.
High attack success rates on safety-focused models like Claude-Sonnet.
Effective in bypassing safety benchmarks and toxic content filters.
Abstract
Large Reasoning Models (LRMs) have significantly advanced beyond traditional Large Language Models (LLMs) with their exceptional logical reasoning capabilities, yet these improvements introduce heightened safety risks. When subjected to jailbreak attacks, their ability to generate more targeted and organized content can lead to greater harm. Although some studies claim that reasoning enables safer LRMs against existing LLM attacks, they overlook the inherent flaws within the reasoning process itself. To address this gap, we propose the first jailbreak attack targeting LRMs, exploiting their unique vulnerabilities stemming from the advanced reasoning capabilities. Specifically, we introduce a Chaos Machine, a novel component to transform attack prompts with diverse one-to-one mappings. The chaos mappings iteratively generated by the machine are embedded into the reasoning chain, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Computability, Logic, AI Algorithms
