Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li, Ruixuan Huang, Zhenlan Ji, Pingchuan Ma, Shuai Wang

TL;DR
This paper introduces a framework for real-time monitoring of reasoning safety in large language models, addressing vulnerabilities in their reasoning processes beyond content safety.
Contribution
It formalizes reasoning safety, proposes a taxonomy of unsafe behaviors, and develops a zero-shot monitor that detects unsafe reasoning steps with high accuracy.
Findings
The taxonomy identifies nine unsafe reasoning behaviors.
The monitor achieves up to 87.11% step-level localization accuracy.
It maintains low false positives and operates with negligible latency.
Abstract
Large language models increasingly rely on explicit chain-of-thought reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work focuses predominantly on content safety (i.e., detecting harmful, biased, or factually incorrect outputs), while treating the underlying reasoning chain as an opaque intermediate artifact. We argue that reasoning safety constitutes a fundamental security dimension orthogonal to content safety: the requirement that a model's reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. In this paper, we formalize reasoning safety and introduce a systematic taxonomy of nine unsafe reasoning behaviors. We then conduct a large-scale prevalence study, annotating over 4,000 reasoning chains across benign benchmarks and four state-of-the-art reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
