TL;DR
RSafe is a novel adaptive safeguard for LLMs that uses guided reasoning and reinforcement learning to improve safety and generalize protection against unseen or adversarial risks, surpassing traditional guard models.
Contribution
The paper introduces RSafe, a two-stage reasoning and reinforcement learning framework that enhances LLM safety by internalizing safety principles and adapting to new threats.
Findings
RSafe outperforms existing guard models in safety robustness.
It generalizes safety protection to unseen and adversarial scenarios.
RSafe can be tailored to specific safety policies during inference.
Abstract
Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts, posing significant risks to users and society. To safeguard against the risk of policy-violating content, system-level moderation via external guard models-designed to monitor LLM inputs and outputs and block potentially harmful content-has emerged as a prevalent mitigation strategy. Existing approaches of training guard models rely heavily on extensive human curated datasets and struggle with out-of-distribution threats, such as emerging harmful categories or jailbreak attacks. To address these limitations, we propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies. RSafe operates in two stages: 1) guided reasoning, where it analyzes safety risks of input content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
