SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Kaiwen Zhou; Xuandong Zhao; Gaowen Liu; Jayanth Srinivasa; Aosong Feng; Dawn Song; Xin Eric Wang

arXiv:2505.16186·cs.AI·November 18, 2025

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, Xin Eric Wang

PDF

Open Access 2 Models 1 Video

TL;DR

SafeKey enhances large reasoning models' safety by activating safety reasoning at critical moments, significantly reducing harmful outputs and improving safety generalization without sacrificing core capabilities.

Contribution

The paper introduces SafeKey, a novel approach with dual objectives that better activate safety reasoning in LRMs, improving safety against unseen jailbreaks and harmful prompts.

Findings

01

Reduces harmfulness rate by 9.6% across benchmarks

02

Improves safety generalization to unseen attacks

03

Reshapes internal attention and representations for safety

Abstract

Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation of LRMs' generation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the `key sentence', which follows models' query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha moment in the key sentence: (1) a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Safety Systems Engineering in Autonomy · Topic Modeling

MethodsSoftmax · Attention Is All You Need