RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning
Zeming Wei, Qiaosheng Zhang, Xia Hu, Xingcheng Xu

TL;DR
This paper introduces RAPO, a framework that enhances large reasoning models' safety by adaptively identifying and mitigating risks in their reasoning process, effectively defending against complex jailbreak attacks.
Contribution
It proposes a novel risk-aware preference optimization method that improves the generalization of safe reasoning in large models against diverse attack prompts.
Findings
RAPO improves safety generalization across various attack prompts.
The framework maintains the utility of large reasoning models.
Experimental results demonstrate robustness against complex jailbreak attacks.
Abstract
Large Reasoning Models (LRMs) have achieved tremendous success with their chain-of-thought (CoT) reasoning, yet also face safety issues similar to those of basic language models. In particular, while algorithms are designed to guide them to deliberately refuse harmful prompts with safe reasoning, this process often fails to generalize against diverse and complex jailbreak attacks. In this work, we attribute these failures to the generalization of the safe reasoning process, particularly their insufficiency against complex attack prompts. We provide both theoretical and empirical evidence to show the necessity of a more sufficient safe reasoning process to defend against advanced attack prompts. Building on this insight, we propose a Risk-Aware Preference Optimization (RAPO) framework that enables LRM to adaptively identify and address the safety risks with appropriate granularity in its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
