Internalizing Safety Understanding in Large Reasoning Models via Verification

Yi Zhang; Yuxin Chen; Leheng Sheng; Dongcheng Zhang; Chaochao Lu; Xiang Wang; An Zhang

arXiv:2605.08930·cs.AI·May 12, 2026

Internalizing Safety Understanding in Large Reasoning Models via Verification

Yi Zhang, Yuxin Chen, Leheng Sheng, Dongcheng Zhang, Chaochao Lu, Xiang Wang, An Zhang

PDF

1 Repo

TL;DR

This paper introduces SInternal, a framework that trains large reasoning models to verify their own safety, improving robustness and providing a better foundation for alignment than traditional methods.

Contribution

It proposes a novel safety internalization approach by training models on verification tasks, enhancing their intrinsic safety understanding and robustness against jailbreaks.

Findings

01

Models trained with SInternal better verify their responses' safety.

02

SInternal improves robustness against out-of-domain jailbreaks.

03

Combining SInternal with reinforcement learning yields superior initialization.

Abstract

While explicit Chain-of-Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AlphaLab-USTC/SInternal
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.