SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li,, Bill Yuchen Lin, Radha Poovendran

TL;DR
This paper systematically evaluates the safety of large reasoning models with long chain-of-thought outputs, introduces new safety metrics, analyzes decoding strategies, and proposes SafeChain, a safety training dataset that improves model safety without sacrificing reasoning performance.
Contribution
It introduces SafeChain, the first safety training dataset in chain-of-thought style, and demonstrates its effectiveness in enhancing model safety while maintaining reasoning capabilities.
Findings
LRMs are less safe than their reasoning capabilities suggest.
Decoding strategies like ZeroThink, LessThink, and MoreThink can improve safety.
SafeChain training dataset enhances safety without harming reasoning performance.
Abstract
Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
