ReasoningShield: Safety Detection over Reasoning Traces of Large Reasoning Models
Changyi Li, Jiayi Wang, Xudong Pan, Geng Hong, Min Yang

TL;DR
ReasoningShield is a new framework designed to detect safety risks in the reasoning traces of large reasoning models, addressing hidden hazards in intermediate steps that existing tools often miss.
Contribution
It formalizes the task of CoT moderation, creates a comprehensive benchmark, and proposes a two-stage training strategy to improve safety detection in reasoning traces.
Findings
Outperforms existing moderation tools by 35.6% in safety detection accuracy.
Achieves 15.8% improvement over GPT-4o on safety benchmarks.
Generalizes well across different reasoning tasks and unseen scenarios.
Abstract
Large Reasoning Models (LRMs) leverage transparent reasoning traces, known as Chain-of-Thoughts (CoTs), to break down complex problems into intermediate steps and derive final answers. However, these reasoning traces introduce unique safety challenges: harmful content can be embedded in intermediate steps even when final answers appear benign. Existing moderation tools, designed to handle generated answers, struggle to effectively detect hidden risks within CoTs. To address these challenges, we introduce ReasoningShield, a lightweight yet robust framework for moderating CoTs in LRMs. Our key contributions include: (1) formalizing the task of CoT moderation with a multi-level taxonomy of 10 risk categories across 3 safety levels, (2) creating the first CoT moderation benchmark which contains 9.2K pairs of queries and reasoning traces, including a 7K-sample training set annotated via a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This paper is well-written and clearly presents the dataset description, construction process, and experiments. 2. To improve dataset quality and label reliability, the authors effectively employ methods such as multi-model annotation and human–AI collaboration. 3. Developing a small model that performs competitively in CoT moderation and maintains consistent performance across reasoning traces from various models makes it highly practical and useful. 4. The model demonstrates robust gener
The dataset and its construction process introduced by the authors are quite impressive; however, there are several issues to point out regarding the experimental setup and analysis of the results. 1. In the pilot study for Table 1, how is the label distribution of the 800-sample dataset? Are the labels well-balanced? Since the F1 score depends on the label distribution, it is important to ensure balance in the dataset. 2. Compared to other models, the F1CoT metric of Claude-Sonnet-3.7 is rema
1. The paper convincingly motivates why answer-only moderation fails, and why stepwise CoT analysis is required; several mainstream LRMs are shown to degrade on CoT moderation vs. answer moderation. 2. ReasoningShield-3B reaches ≈91.8% F1, beating LlamaGuard-4 by ~35% and GPT-4o by ~15%; a 1B variant also performs competitively. The two-stage training on small backbones is easy to replicate and practical for deployment. This efficient and accurate detection tool has brought great application
1. Although the dataset mixes sources and shows OOD gains, more naturally occurring product logs or red-teaming traces (with appropriate approvals) would better validate real-world robustness beyond curated constructs. 2. Results are strong aggregate-wise, but more category-wise error analysis would guide deployment boundaries. 3. While the paper demonstrates significant practical value, it lacks theoretical analysis and higher-level methodologies, making it difficult to further generalize t
1. The problem motivation is timely. The paper clearly demonstrates that releasing CoT increases safety risks and that existing safety monitors fail to capture covert malicious reasoning. 2. A reasonably sized dataset with structured safety labels is provided, which may benefit future research on LRM safety and CoT oversight. 3. The approach is simple, inference-efficient, and improves smaller model performance, which is important for realistic deployment scenarios. 4. Empirical results cover se
1. Innovation is relatively incremental. The method primarily applies standard SFT + DPO pipelines to a new dataset rather than proposing new architectures or fundamentally novel mechanisms for CoT safety. 2. Dataset construction and annotation quality are insufficiently detailed. The paper lacks breakdowns on annotator consistency, error rates, and quality control procedures. 3. Limited exploration of scaling behavior. The study focuses on smaller models; results on larger safety monitors (e.g.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
