Scaling Laws for Reward Model Overoptimization in Direct Alignment   Algorithms

Rafael Rafailov; Yaswanth Chittepu; Ryan Park; Harshit Sikchi; Joey; Hejna; Bradley Knox; Chelsea Finn; Scott Niekum

arXiv:2406.02900·cs.LG·November 6, 2024·3 cites

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey, Hejna, Bradley Knox, Chelsea Finn, Scott Niekum

PDF

Open Access

TL;DR

This paper investigates the phenomenon of reward over-optimization in Direct Alignment Algorithms (DAAs), revealing that they suffer from similar degradation issues as classical RLHF methods, especially at higher KL budgets, through extensive empirical analysis.

Contribution

It formally defines and analyzes reward over-optimization in DAAs, highlighting its occurrence across various training regimes and model scales, which was previously underexplored.

Findings

01

DAAs exhibit performance deterioration at higher KL budgets.

02

Reward over-optimization occurs even before completing a single dataset epoch.

03

Degradation patterns in DAAs are similar to those in classical RLHF methods.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained to represent human preferences, which is in turn used by an online reinforcement learning (RL) algorithm to optimize the LLM. A prominent issue with such methods is reward over-optimization or reward hacking, where performance as measured by the learned proxy reward model increases, but true quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline by circumventing the reward modeling phase. However, although DAAs do not use a separate proxy reward model, they still commonly deteriorate from over-optimization. While the so-called reward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Control Systems Optimization · Simulation Techniques and Applications · Statistical and Computational Modeling