Large Reasoning Models Are Autonomous Jailbreak Agents

Thilo Hagendorff; Erik Derner; Nuria Oliver

arXiv:2508.04039·cs.CL·February 10, 2026

Large Reasoning Models Are Autonomous Jailbreak Agents

Thilo Hagendorff, Erik Derner, Nuria Oliver

PDF

TL;DR

Large reasoning models can autonomously and effectively jailbreak other AI models, posing significant safety challenges and highlighting the need for improved alignment and safety measures.

Contribution

This paper demonstrates that large reasoning models can autonomously perform jailbreaking, making the process scalable and accessible to non-experts, which is a novel threat to AI safety.

Findings

01

Achieved 97.14% success rate in jailbreaking target models

02

LRMs can systematically erode safety guardrails of other models

03

Autonomous jailbreaking by LRMs is scalable and requires no supervision

Abstract

Jailbreaking -- bypassing built-in safety mechanisms in AI models -- has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts composed of 70 items covering seven sensitive domains. This setup yielded an overall attack success rate across all model combinations of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.