REBEL: Reinforcement Learning via Regressing Relative Rewards

Zhaolin Gao; Jonathan D. Chang; Wenhao Zhan; Owen Oertell; Gokul; Swamy; Kiant\'e Brantley; Thorsten Joachims; J. Andrew Bagnell; Jason D. Lee,; Wen Sun

arXiv:2404.16767·cs.LG·December 11, 2024·2 cites

REBEL: Reinforcement Learning via Regressing Relative Rewards

Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul, Swamy, Kiant\'e Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee,, Wen Sun

PDF

Open Access 3 Repos 6 Models 3 Datasets 1 Video

TL;DR

REBEL is a minimalist reinforcement learning algorithm that simplifies policy optimization by regressing relative rewards, matching or surpassing PPO's performance while being easier to implement and more efficient.

Contribution

It introduces REBEL, a lightweight RL algorithm that reduces policy optimization to relative reward regression, with strong theoretical guarantees and practical success in language and image tasks.

Findings

01

REBEL matches or exceeds PPO and DPO performance.

02

REBEL is simpler to implement and more computationally efficient.

03

REBEL performs well in fine-tuning large language models.

Abstract

While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

REBEL: Reinforcement Learning via Regressing Relative Rewards· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization