How Does the Thinking Step Influence Model Safety? An Entropy-based Safety Reminder for LRMs
Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han

TL;DR
This paper introduces SafeRemind, a decoding-time method that injects safe-reminding phrases into LRMs' thinking steps using entropy triggers, significantly improving safety without retraining.
Contribution
It proposes a novel entropy-based safety intervention, SafeRemind, that enhances LRM safety during decoding by dynamically inserting safety prompts without parameter updates.
Findings
Up to 45.5% safety improvement across benchmarks
Effective safety enhancement without harming reasoning utility
Applicable to five different LRMs
Abstract
Large Reasoning Models (LRMs) achieve remarkable success through explicit thinking steps, yet the thinking steps introduce a novel risk by potentially amplifying unsafe behaviors. Despite this vulnerability, conventional defense mechanisms remain ineffective as they overlook the unique reasoning dynamics of LRMs. In this work, we find that the emergence of safe-reminding phrases within thinking steps plays a pivotal role in ensuring LRM safety. Motivated by this finding, we propose SafeRemind, a decoding-time defense method that dynamically injects safe-reminding phrases into thinking steps. By leveraging entropy triggers to intervene at decision-locking points, SafeRemind redirects potentially harmful trajectories toward safer outcomes without requiring any parameter updates. Extensive evaluations across five LRMs and six benchmarks demonstrate that SafeRemind substantially enhances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
