Enhancing Safety in Reinforcement Learning with Human Feedback via   Rectified Policy Optimization

Xiyue Peng; Hengquan Guo; Jiawei Zhang; Dongqing Zou; Ziyu Shao,; Honghao Wei; Xin Liu

arXiv:2410.19933·cs.LG·February 28, 2025

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Xiyue Peng, Hengquan Guo, Jiawei Zhang, Dongqing Zou, Ziyu Shao,, Honghao Wei, Xin Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Rectified Policy Optimization (RePO), a novel method that enforces safety constraints on every prompt in reinforcement learning with human feedback, improving safety alignment of large language models.

Contribution

RePO replaces expected safety constraints with prompt-wise safety constraints using rectified policy gradients, addressing safety trade-offs in LLM alignment.

Findings

01

RePO outperforms baseline safety methods.

02

RePO enhances safety across nearly all prompts.

03

RePO significantly improves LLM safety alignment.

Abstract

Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. This paper identifies a potential issue when using the widely adopted expected safety constraints for LLM safety alignment, termed "safety compensation", where the constraints are satisfied on expectation, but individual prompts may trade off safety, resulting in some responses being overly restrictive while others remain unsafe. To address this issue, we propose Rectified Policy Optimization (RePO), which replaces the expected safety constraint with critical safety constraints imposed on every prompt. At the core of RePO is a policy update mechanism driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pxywatermoon/repo
pytorchOfficial

Videos

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics