THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee; Sangwoo Park; Yumin Choi; Gyeongman Kim; Minki Kang; Jihun Yun; Dongmin Park; Jongho Park; Sung Ju Hwang

arXiv:2601.23143·cs.AI·May 14, 2026

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang

PDF

1 Repo 7 Models 11 Datasets

TL;DR

ThinkSafe is a novel framework that enhances the safety of reasoning models by self-generated alignment, avoiding external teachers and maintaining reasoning performance with less computational cost.

Contribution

It introduces a self-generated safety alignment method based on KL projection, improving safety without external teachers and reducing compute requirements.

Findings

01

Significantly improves safety while preserving reasoning ability.

02

Achieves superior safety and comparable reasoning to existing methods.

03

Requires roughly an order of magnitude less compute.

Abstract

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seanie12/ThinkSafe
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.