THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang

TL;DR
ThinkSafe is a novel framework that enhances the safety of reasoning models by self-generated alignment, avoiding external teachers and maintaining reasoning performance with less computational cost.
Contribution
It introduces a self-generated safety alignment method based on KL projection, improving safety without external teachers and reducing compute requirements.
Findings
Significantly improves safety while preserving reasoning ability.
Achieves superior safety and comparable reasoning to existing methods.
Requires roughly an order of magnitude less compute.
Abstract
Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Seanie-lee/ThinkSafe-Qwen3-0.6Bmodel· 7 dl7 dl
- 🤗Seanie-lee/ThinkSafe-Qwen3-1.7Bmodel· 17 dl17 dl
- 🤗Seanie-lee/ThinkSafe-Qwen3-4Bmodel· 25 dl25 dl
- 🤗Seanie-lee/ThinkSafe-Qwen3-8Bmodel· 61 dl61 dl
- 🤗Seanie-lee/ThinkSafe-R1-Distill-1.5Bmodel· 17 dl17 dl
- 🤗Seanie-lee/ThinkSafe-R1-Distill-7Bmodel· 23 dl23 dl
- 🤗Seanie-lee/ThinkSafe-R1-Distill-8Bmodel· 19 dl19 dl
- Seanie-lee/ThinkSafe-R1-Distill-1.5Bdataset· 34 dl34 dl
- Seanie-lee/ThinkSafe-R1-Distill-7Bdataset· 23 dl23 dl
- Seanie-lee/ThinkSafe-R1-Distill-8Bdataset· 28 dl28 dl
- Seanie-lee/ThinkSafe-Qwen3-0.6B-WildGuarddataset· 17 dl17 dl
- Seanie-lee/ThinkSafe-Qwen3-1.7B-WildGuarddataset· 20 dl20 dl
- Seanie-lee/ThinkSafe-Qwen3-4B-WildGuarddataset· 22 dl22 dl
- Seanie-lee/ThinkSafe-Qwen3-8B-WildGuarddataset· 23 dl23 dl
- Seanie-lee/ThinkSafe-Qwen3-0.6Bdataset· 27 dl27 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
