How Does the Thinking Step Influence Model Safety? An Entropy-based Safety Reminder for LRMs

Su-Hyeon Kim; Hyundong Jin; Yejin Lee; Yo-Sub Han

arXiv:2601.03662·cs.AI·January 8, 2026

How Does the Thinking Step Influence Model Safety? An Entropy-based Safety Reminder for LRMs

Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han

PDF

Open Access

TL;DR

This paper introduces SafeRemind, a decoding-time method that injects safe-reminding phrases into LRMs' thinking steps using entropy triggers, significantly improving safety without retraining.

Contribution

It proposes a novel entropy-based safety intervention, SafeRemind, that enhances LRM safety during decoding by dynamically inserting safety prompts without parameter updates.

Findings

01

Up to 45.5% safety improvement across benchmarks

02

Effective safety enhancement without harming reasoning utility

03

Applicable to five different LRMs

Abstract

Large Reasoning Models (LRMs) achieve remarkable success through explicit thinking steps, yet the thinking steps introduce a novel risk by potentially amplifying unsafe behaviors. Despite this vulnerability, conventional defense mechanisms remain ineffective as they overlook the unique reasoning dynamics of LRMs. In this work, we find that the emergence of safe-reminding phrases within thinking steps plays a pivotal role in ensuring LRM safety. Motivated by this finding, we propose SafeRemind, a decoding-time defense method that dynamically injects safe-reminding phrases into thinking steps. By leveraging entropy triggers to intervene at decision-locking points, SafeRemind redirects potentially harmful trajectories toward safer outcomes without requiring any parameter updates. Extensive evaluations across five LRMs and six benchmarks demonstrate that SafeRemind substantially enhances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI