Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation
Jianan Chen, Zhifang Zhang, Shuo He, Linan Yue, Lei Feng, Minling Zhang

TL;DR
This paper introduces a safety alignment method that encourages large reasoning models to make safety decisions before chain-of-thought generation, improving safety without sacrificing reasoning performance.
Contribution
It proposes a novel approach to enhance safety in LRMs by integrating safety decision signals prior to CoT generation, addressing safety degradation issues.
Findings
Significantly improves safety capabilities of LRMs.
Maintains reasoning performance while enhancing safety.
Effective safety decision-making before CoT generation.
Abstract
Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs' safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs' safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
