Stabilizing RLHF through Advantage Model and Selective Rehearsal
Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, and Dong Yu

TL;DR
This paper introduces two novel techniques, Advantage Model and Selective Rehearsal, to enhance the stability and performance of RLHF training for large language models, effectively reducing reward hacking and catastrophic forgetting.
Contribution
The paper presents innovative methods to stabilize RLHF training by modeling advantage scores and strategically selecting rehearsal data, improving reward alignment and training robustness.
Findings
Increased stability in RLHF training.
Higher reward scores and win rates.
Reduced reward hacking and catastrophic forgetting.
Abstract
Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting. In this technical report, we propose two innovations to stabilize RLHF training: 1) Advantage Model, which directly models advantage score i.e., extra reward compared to the expected rewards and regulates score distributions across tasks to prevent reward hacking. 2) Selective Rehearsal, which mitigates catastrophic forgetting by strategically selecting data for PPO training and knowledge rehearsing. Our experimental analysis on public and proprietary datasets reveals that the proposed methods not only increase stability in RLHF training but also achieve higher reward scores and win rates.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
MethodsEntropy Regularization · Proximal Policy Optimization
