TL;DR
HarmThoughts is a new benchmark for step-wise safety evaluation in reasoning traces of large models, focusing on how harm propagates during multi-step reasoning processes.
Contribution
The paper introduces HarmThoughts, a fine-grained dataset and taxonomy for analyzing harm emergence in reasoning traces, enabling better safety monitoring and detection.
Findings
Existing detectors struggle with nuanced harm detection in reasoning traces.
Harm propagation patterns reveal common behavioral trajectories and drift points.
Fine-grained behavior detection remains a significant challenge for current safety systems.
Abstract
Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence-level granularity within reasoning traces -- a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step-wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
