ReasoningShield: Safety Detection over Reasoning Traces of Large Reasoning Models

Changyi Li; Jiayi Wang; Xudong Pan; Geng Hong; Min Yang

arXiv:2505.17244·cs.CL·October 16, 2025

ReasoningShield: Safety Detection over Reasoning Traces of Large Reasoning Models

Changyi Li, Jiayi Wang, Xudong Pan, Geng Hong, Min Yang

PDF

Open Access 3 Reviews

TL;DR

ReasoningShield is a new framework designed to detect safety risks in the reasoning traces of large reasoning models, addressing hidden hazards in intermediate steps that existing tools often miss.

Contribution

It formalizes the task of CoT moderation, creates a comprehensive benchmark, and proposes a two-stage training strategy to improve safety detection in reasoning traces.

Findings

01

Outperforms existing moderation tools by 35.6% in safety detection accuracy.

02

Achieves 15.8% improvement over GPT-4o on safety benchmarks.

03

Generalizes well across different reasoning tasks and unseen scenarios.

Abstract

Large Reasoning Models (LRMs) leverage transparent reasoning traces, known as Chain-of-Thoughts (CoTs), to break down complex problems into intermediate steps and derive final answers. However, these reasoning traces introduce unique safety challenges: harmful content can be embedded in intermediate steps even when final answers appear benign. Existing moderation tools, designed to handle generated answers, struggle to effectively detect hidden risks within CoTs. To address these challenges, we introduce ReasoningShield, a lightweight yet robust framework for moderating CoTs in LRMs. Our key contributions include: (1) formalizing the task of CoT moderation with a multi-level taxonomy of 10 risk categories across 3 safety levels, (2) creating the first CoT moderation benchmark which contains 9.2K pairs of queries and reasoning traces, including a 7K-sample training set annotated via a…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. This paper is well-written and clearly presents the dataset description, construction process, and experiments. 2. To improve dataset quality and label reliability, the authors effectively employ methods such as multi-model annotation and human–AI collaboration. 3. Developing a small model that performs competitively in CoT moderation and maintains consistent performance across reasoning traces from various models makes it highly practical and useful. 4. The model demonstrates robust gener

Weaknesses

The dataset and its construction process introduced by the authors are quite impressive; however, there are several issues to point out regarding the experimental setup and analysis of the results. 1. In the pilot study for Table 1, how is the label distribution of the 800-sample dataset? Are the labels well-balanced? Since the F1 score depends on the label distribution, it is important to ensure balance in the dataset. 2. Compared to other models, the F1CoT metric of Claude-Sonnet-3.7 is rema

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper convincingly motivates why answer-only moderation fails, and why stepwise CoT analysis is required; several mainstream LRMs are shown to degrade on CoT moderation vs. answer moderation. 2. ReasoningShield-3B reaches ≈91.8% F1, beating LlamaGuard-4 by ~35% and GPT-4o by ~15%; a 1B variant also performs competitively. The two-stage training on small backbones is easy to replicate and practical for deployment. This efficient and accurate detection tool has brought great application

Weaknesses

1. Although the dataset mixes sources and shows OOD gains, more naturally occurring product logs or red-teaming traces (with appropriate approvals) would better validate real-world robustness beyond curated constructs. 2. Results are strong aggregate-wise, but more category-wise error analysis would guide deployment boundaries. 3. While the paper demonstrates significant practical value, it lacks theoretical analysis and higher-level methodologies, making it difficult to further generalize t

Reviewer 03Rating 4Confidence 4

Strengths

1. The problem motivation is timely. The paper clearly demonstrates that releasing CoT increases safety risks and that existing safety monitors fail to capture covert malicious reasoning. 2. A reasonably sized dataset with structured safety labels is provided, which may benefit future research on LRM safety and CoT oversight. 3. The approach is simple, inference-efficient, and improves smaller model performance, which is important for realistic deployment scenarios. 4. Empirical results cover se

Weaknesses

1. Innovation is relatively incremental. The method primarily applies standard SFT + DPO pipelines to a new dataset rather than proposing new architectures or fundamentally novel mechanisms for CoT safety. 2. Dataset construction and annotation quality are insufficiently detailed. The paper lacks breakdowns on annotator consistency, error rates, and quality control procedures. 3. Limited exploration of scaling behavior. The study focuses on smaller models; results on larger safety monitors (e.g.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning