When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

Ishita Kakkar; Enze Zhang; Rheeya Uppaal; Junjie Hu

arXiv:2604.19001·cs.CL·April 22, 2026

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

Ishita Kakkar, Enze Zhang, Rheeya Uppaal, Junjie Hu

PDF

1 Repo

TL;DR

HarmThoughts is a new benchmark for step-wise safety evaluation in reasoning traces of large models, focusing on how harm propagates during multi-step reasoning processes.

Contribution

The paper introduces HarmThoughts, a fine-grained dataset and taxonomy for analyzing harm emergence in reasoning traces, enabling better safety monitoring and detection.

Findings

01

Existing detectors struggle with nuanced harm detection in reasoning traces.

02

Harm propagation patterns reveal common behavioral trajectories and drift points.

03

Fine-grained behavior detection remains a significant challenge for current safety systems.

Abstract

Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence-level granularity within reasoning traces -- a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step-wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/ishitakakkar-10/HarmThoughts
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.