ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Yuquan Wang; Mi Zhang; Yining Wang; Geng Hong; Mi Wen; Xiaoyu You; Min Yang

arXiv:2508.04204·cs.CL·May 7, 2026

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Mi Wen, Xiaoyu You, Min Yang

PDF

TL;DR

ReasoningGuard is an inference-time safety method for large reasoning models that injects safety reflections during reasoning to prevent harmful content, outperforming existing safeguards with minimal extra cost.

Contribution

It introduces a novel inference-time safety mechanism leveraging internal attention to identify key reasoning points and guide models towards harmless outputs.

Findings

01

Effectively mitigates four types of jailbreak attacks.

02

Outperforms nine existing safety safeguards.

03

Maintains minimal additional inference cost.

Abstract

Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Current defense methods, however, depend on costly fine-tuning and additional expert knowledge, which limits their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs. It injects timely safety aha moments during the reasoning process to guide the model towards harmless yet helpful reasoning. Our approach leverages the internal attention mechanisms of the LRM to accurately identify key points in the reasoning path, triggering safety-oriented reflections. To safeguard both the subsequent reasoning steps and the final answers, we implement a scaling sampling strategy during decoding to select the optimal reasoning path. With…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.