SafeChain: Safety of Language Models with Long Chain-of-Thought   Reasoning Capabilities

Fengqing Jiang; Zhangchen Xu; Yuetai Li; Luyao Niu; Zhen Xiang; Bo Li,; Bill Yuchen Lin; Radha Poovendran

arXiv:2502.12025·cs.AI·February 18, 2025

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li,, Bill Yuchen Lin, Radha Poovendran

PDF

Open Access 2 Models 2 Datasets 1 Video

TL;DR

This paper systematically evaluates the safety of large reasoning models with long chain-of-thought outputs, introduces new safety metrics, analyzes decoding strategies, and proposes SafeChain, a safety training dataset that improves model safety without sacrificing reasoning performance.

Contribution

It introduces SafeChain, the first safety training dataset in chain-of-thought style, and demonstrates its effectiveness in enhancing model safety while maintaining reasoning capabilities.

Findings

01

LRMs are less safe than their reasoning capabilities suggest.

02

Decoding strategies like ZeroThink, LessThink, and MoreThink can improve safety.

03

SafeChain training dataset enhances safety without harming reasoning performance.

Abstract

Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques