Large Reasoning Models Are Autonomous Jailbreak Agents
Thilo Hagendorff, Erik Derner, Nuria Oliver

TL;DR
Large reasoning models can autonomously and effectively jailbreak other AI models, posing significant safety challenges and highlighting the need for improved alignment and safety measures.
Contribution
This paper demonstrates that large reasoning models can autonomously perform jailbreaking, making the process scalable and accessible to non-experts, which is a novel threat to AI safety.
Findings
Achieved 97.14% success rate in jailbreaking target models
LRMs can systematically erode safety guardrails of other models
Autonomous jailbreaking by LRMs is scalable and requires no supervision
Abstract
Jailbreaking -- bypassing built-in safety mechanisms in AI models -- has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts composed of 70 items covering seven sensitive domains. This setup yielded an overall attack success rate across all model combinations of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
