RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Jingnan Zheng; Xiangtian Ji; Yijun Lu; Chenhang Cui; Weixiang Zhao; Gelei Deng; Zhenkai Liang; An Zhang; Tat-Seng Chua

arXiv:2506.07736·cs.AI·October 27, 2025

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

Jingnan Zheng, Xiangtian Ji, Yijun Lu, Chenhang Cui, Weixiang Zhao, Gelei Deng, Zhenkai Liang, An Zhang, Tat-Seng Chua

PDF

1 Repo

TL;DR

RSafe is a novel adaptive safeguard for LLMs that uses guided reasoning and reinforcement learning to improve safety and generalize protection against unseen or adversarial risks, surpassing traditional guard models.

Contribution

The paper introduces RSafe, a two-stage reasoning and reinforcement learning framework that enhances LLM safety by internalizing safety principles and adapting to new threats.

Findings

01

RSafe outperforms existing guard models in safety robustness.

02

It generalizes safety protection to unseen and adversarial scenarios.

03

RSafe can be tailored to specific safety policies during inference.

Abstract

Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts, posing significant risks to users and society. To safeguard against the risk of policy-violating content, system-level moderation via external guard models-designed to monitor LLM inputs and outputs and block potentially harmful content-has emerged as a prevalent mitigation strategy. Existing approaches of training guard models rely heavily on extensive human curated datasets and struggle with out-of-distribution threats, such as emerging harmful categories or jailbreak attacks. To address these limitations, we propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies. RSafe operates in two stages: 1) guided reasoning, where it analyzes safety risks of input content…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sophiezheng998/rsafe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN