When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao; Chunkang Zhang; Junxiang Wang; Xinyan Guan; Boxi Cao; Yaojie Lu; Hongyu Lin; Xianpei Han; Le Sun

arXiv:2510.21285·cs.AI·April 27, 2026

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

PDF

TL;DR

This paper identifies a new safety failure mode in large reasoning models called Self-Jailbreak, where models initially recognize harm but override safety judgments during reasoning, and proposes a targeted training method to mitigate it.

Contribution

The paper uncovers Self-Jailbreak as a novel safety failure in LRMs and introduces Chain-of-Guardrail (CoG), a step-level training framework to address it while preserving reasoning ability.

Findings

01

CoG effectively reduces Self-Jailbreak incidents.

02

CoG maintains strong reasoning performance.

03

Experiments show improved safety and reasoning balance.

Abstract

Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose Chain-of-Guardrail(CoG), a trajectory-level training framework that mitigates Self-Jailbreak via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.