R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge
Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park

TL;DR
This paper introduces R1-Act, a post-training method that activates safety knowledge in large reasoning models, significantly improving safety with minimal training data and computational resources.
Contribution
It presents a novel, efficient post-training approach to activate safety knowledge in LRMs, enhancing safety without compromising reasoning ability.
Findings
R1-Act outperforms previous methods in safety improvements.
It requires only 1,000 training examples and 90 minutes of training.
The approach is robust across various model sizes and types.
Abstract
Although large reasoning models (LRMs) have demonstrated impressive capabilities on complex tasks, recent studies reveal that these models frequently fulfill harmful user instructions, raising significant safety concerns. In this paper, we investigate the underlying cause of LRM safety risks and find that models already possess sufficient safety knowledge but fail to activate it during reasoning. Based on this insight, we propose R1-Act, a simple and efficient post-training method that explicitly triggers safety knowledge through a structured reasoning process. R1-Act achieves strong safety improvements while preserving reasoning performance, outperforming prior alignment methods. Notably, it requires only 1,000 training examples and 90 minutes of training on a single RTX A6000 GPU. Extensive experiments across multiple LRM backbones and sizes demonstrate the robustness, scalability,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Safety Systems Engineering in Autonomy · Explainable Artificial Intelligence (XAI)
