UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

Raj Vardhan Tomar; Preslav Nakov; Yuxia Wang

arXiv:2507.21652·cs.CL·March 31, 2026

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang

PDF

1 Repo 1 Datasets

TL;DR

UnsafeChain is a new safety dataset for reasoning models, focusing on hard prompts that elicit unsafe responses, and improves safety through correction-based fine-tuning.

Contribution

It introduces a dataset of hard unsafe prompts with corrections, and demonstrates improved safety and reasoning ability in models trained on it.

Findings

01

Models fine-tuned on UnsafeChain outperform previous datasets.

02

A 1K subset of UnsafeChain matches or exceeds baseline performance.

03

UnsafeChain enhances safety while maintaining reasoning capabilities.

Abstract

As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mbzuai-nlp/UnsafeChain
github

Datasets

raj-tomar001/UnSafeChain
dataset· 68 dl
68 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.