Stabilizing RLHF through Advantage Model and Selective Rehearsal

Baolin Peng; Linfeng Song; Ye Tian; Lifeng Jin; Haitao Mi; and Dong Yu

arXiv:2309.10202·cs.CL·September 20, 2023·2 cites

Stabilizing RLHF through Advantage Model and Selective Rehearsal

Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, and Dong Yu

PDF

Open Access

TL;DR

This paper introduces two novel techniques, Advantage Model and Selective Rehearsal, to enhance the stability and performance of RLHF training for large language models, effectively reducing reward hacking and catastrophic forgetting.

Contribution

The paper presents innovative methods to stabilize RLHF training by modeling advantage scores and strategically selecting rehearsal data, improving reward alignment and training robustness.

Findings

01

Increased stability in RLHF training.

02

Higher reward scores and win rates.

03

Reduced reward hacking and catastrophic forgetting.

Abstract

Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting. In this technical report, we propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score i.e., extra reward compared to the expected rewards and regulates score distributions across tasks to prevent reward hacking. 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing. Our experimental analysis on public and proprietary datasets reveals that the proposed methods not only increase stability in RLHF training but also achieve higher reward scores and win rates.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare

MethodsEntropy Regularization · Proximal Policy Optimization