SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, Xin Eric Wang

TL;DR
SafeKey enhances large reasoning models' safety by activating safety reasoning at critical moments, significantly reducing harmful outputs and improving safety generalization without sacrificing core capabilities.
Contribution
The paper introduces SafeKey, a novel approach with dual objectives that better activate safety reasoning in LRMs, improving safety against unseen jailbreaks and harmful prompts.
Findings
Reduces harmfulness rate by 9.6% across benchmarks
Improves safety generalization to unseen attacks
Reshapes internal attention and representations for safety
Abstract
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation of LRMs' generation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the `key sentence', which follows models' query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha moment in the key sentence: (1) a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Safety Systems Engineering in Autonomy · Topic Modeling
MethodsSoftmax · Attention Is All You Need
