TL;DR
This paper introduces SInternal, a framework that trains large reasoning models to verify their own safety, improving robustness and providing a better foundation for alignment than traditional methods.
Contribution
It proposes a novel safety internalization approach by training models on verification tasks, enhancing their intrinsic safety understanding and robustness against jailbreaks.
Findings
Models trained with SInternal better verify their responses' safety.
SInternal improves robustness against out-of-domain jailbreaks.
Combining SInternal with reinforcement learning yields superior initialization.
Abstract
While explicit Chain-of-Thought (CoT) empowers large reasoning models (LRMs), it enables the generation of riskier final answers. Current alignment paradigms primarily rely on externally enforced compliance, optimizing models to detect malicious prompts rather than evaluating the safety of their own outputs. We argue that this approach remains largely behavioral: our empirical analysis reveals that ostensibly aligned models lack intrinsic safety understanding, often failing to verify their own response safety and remaining vulnerable to adversarial jailbreaks. To address this fundamental limitation, we propose Safety Internal (SInternal), a framework that internalizes safety specifications by training LRMs exclusively on safety verification tasks to critique their own generated answers using expert reasoning trajectories. We demonstrate that learning to verify induces a strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
