Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability

Jinghan Jia; Nathalie Baracaldo; Sijia Liu

arXiv:2512.01848·cs.CL·December 2, 2025

Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability

Jinghan Jia, Nathalie Baracaldo, Sijia Liu

PDF

Open Access

TL;DR

This paper explores reinforcement learning as a method to improve safety and reasoning ability in large reasoning models, addressing limitations of supervised fine-tuning.

Contribution

It introduces RL-based safety training for LRMs, demonstrating improved safety and reasoning consistency over traditional supervised approaches.

Findings

01

RL achieves stronger safety improvements than SFT.

02

RL maintains reasoning ability better than SFT.

03

Analysis shows RL reduces unsafe reasoning trajectories.

Abstract

Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks