Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Soumya Suvra Ghosal; Souradip Chakraborty; Vaibhav Singh; Furong Huang; Dinesh Manocha; Amrit Singh Bedi

arXiv:2602.11096·cs.CL·February 12, 2026

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi

PDF

Open Access

TL;DR

SafeThink is an inference-time method that effectively enhances safety in reasoning models by intervening early in the reasoning process, significantly reducing jailbreak success rates while maintaining reasoning accuracy.

Contribution

We introduce SafeThink, a novel safety recovery approach that uses early steering steps to improve safety without compromising reasoning performance.

Findings

01

Safety recovery often requires only 1-3 early steering steps.

02

SafeThink reduces jailbreak success rates by 30-60%.

03

Reasoning performance remains stable after safety interventions.

Abstract

Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI