How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

Zhexin Zhang; Xian Qi Loye; Victor Shea-Jay Huang; Junxiao Yang; Qi Zhu; Shiyao Cui; Fei Mi; Lifeng Shang; Yingkang Wang; Hongning Wang; Minlie Huang

arXiv:2505.15404·cs.CL·April 21, 2026

How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

Zhexin Zhang, Xian Qi Loye, Victor Shea-Jay Huang, Junxiao Yang, Qi Zhu, Shiyao Cui, Fei Mi, Lifeng Shang, Yingkang Wang, Hongning Wang, Minlie Huang

PDF

2 Repos 1 Datasets

TL;DR

This paper empirically investigates methods to improve the safety of Large Reasoning Models through supervised fine-tuning, analyzing data issues, reasoning process complexity, and training configurations.

Contribution

It identifies key risky patterns affecting safety, demonstrates effective data addressing strategies, and shows that simple reasoning processes can match complex ones in safety performance.

Findings

01

Addressing risky patterns during data distillation improves safety.

02

Short or template-based reasoning achieves safety comparable to complex reasoning.

03

Different training configurations significantly impact safety outcomes.

Abstract

Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming. However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance-and in some cases, may even degrade it. This raises an important research question: how should we enhance the safety of LRMs? In this paper, we present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning (SFT). Our investigation begins with an unexpected observation: directly distilling safe responses from DeepSeek-R1 fails to significantly enhance safety. We analyze this phenomenon and identify five key risky patterns that contribute to it. We then demonstrate that explicitly addressing these issues during the data distillation process can lead to substantial safety improvements. Next, we explore whether a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

thu-coai/LRM-Safety-Study
dataset· 48 dl
48 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.