AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models
Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, Ting Wang

TL;DR
AutoRAN introduces an automated framework that hijacks the safety reasoning of large models by simulating and exploiting their reasoning patterns, revealing vulnerabilities in their safety guardrails.
Contribution
It pioneers an execution simulation paradigm to effectively hijack safety reasoning in large models, demonstrating near 100% success against state-of-the-art defenses.
Findings
AutoRAN achieves nearly 100% success rate in hijacking safety reasoning.
It effectively bypasses reasoning-based safety defenses in large models.
The approach exposes the vulnerability of reasoning traces as an attack surface.
Abstract
This paper presents AutoRAN, the first framework to automate the hijacking of internal safety reasoning in large reasoning models (LRMs). At its core, AutoRAN pioneers an execution simulation paradigm that leverages a weaker but less-aligned model to simulate execution reasoning for initial hijacking attempts and iteratively refine attacks by exploiting reasoning patterns leaked through the target LRM's refusals. This approach steers the target model to bypass its own safety guardrails and elaborate on harmful instructions. We evaluate AutoRAN against state-of-the-art LRMs, including GPT-o3/o4-mini and Gemini-2.5-Flash, across multiple benchmarks (AdvBench, HarmBench, and StrongReject). Results show that AutoRAN achieves approaching 100% success rate within one or few turns, effectively neutralizing reasoning-based defenses even when evaluated by robustly aligned external models. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
