Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai; Xuehai Pan; Ruiyang Sun; Jiaming Ji; Xinbo Xu; Mickel Liu,; Yizhou Wang; Yaodong Yang

arXiv:2310.12773·cs.AI·October 20, 2023·20 cites

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu,, Yizhou Wang, Yaodong Yang

PDF

Open Access 1 Repo 10 Models 1 Video 3 Reviews

TL;DR

This paper introduces Safe RLHF, a novel reinforcement learning algorithm that improves large language models by balancing helpfulness and harmlessness through separate reward and cost models, formalized as a constrained optimization problem.

Contribution

Safe RLHF explicitly decouples helpfulness and harmlessness preferences, enabling more effective training of aligned language models with improved safety and performance.

Findings

01

Safe RLHF effectively reduces harmful responses in LLMs.

02

Fine-tuning Alpaca-7B with Safe RLHF enhances helpfulness and safety.

03

The method outperforms existing value-aligned algorithms in experiments.

Abstract

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

With safety being an important aspect in LLMs, this paper tackles an important question -- how to do value alignment under both the safety and usefulness axes. The paper is well written and explains the methodology involved clearly. Even if the techniques to accommodate safety costs into RLHF are simple and straightforward, the paper does a good job in explaining the motivation behind the choices and conducts careful ablations to demonstrate the motivations behind these choices. The evaluation m

Weaknesses

One thing that I feel the paper could do a better job of is to incorporate more safe RLHF baselines. For example, Constitutional AI [1] tackles a very similar problem balancing helpfulness and harmlessness. The only couple of ablations that I can see are of fixed lambda (reward shaping approach) and the approach used in Sparrow. I would have loved to see one or two more safe RLHF approaches that do not need to tow the lines of conventional RLHF exactly. The improvement achieved in the RLHF stag

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

1. Separation of rewards and costs is an excellent idea that probably resolves the optimization contradiction in RLHF of LLM. 2. The paper provides concrete experimental results demonstrating the effectiveness of Safe RLHF in enhancing model performance and reducing harmful responses.

Weaknesses

Minor suggestions in Questions.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

1. This paper is well-written and easy to follow. The authors present a well-defined methodology, including a clear description of the Safe RLHF pipeline, preference annotation process, and training algorithms for reward and cost models. 2. Given the societal impact of LLMs, ensuring their safety and usefulness is of utmost importance. Safe RLHF presents a significant contribution by effectively aligning human values with model behavior, addressing an essential concern in AI research.

Weaknesses

1. The technique contributions seem incremental to me. The decoupling of rewards into rewards and costs is a standard formulation in CMDP, and the Lagrangian methods with RL are not new at all. 2. Another concern in this paper is that I don't think there is an appropriate cost threshold and cost-reward trade-off in the LLM alignment settings. I think safety is cleary a priority when compared with preferences. So that being said, how do you define how much safety LLMs are to trade off the prefe

Code & Models

Repositories

pku-alignment/safe-rlhf
pytorchOfficial

Models

Videos

Safe RLHF: Safe Reinforcement Learning from Human Feedback· slideslive

Taxonomy

TopicsSoftware Engineering Research · Explainable Artificial Intelligence (XAI) · Topic Modeling