REBEL: Reinforcement Learning via Regressing Relative Rewards
Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul, Swamy, Kiant\'e Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee,, Wen Sun

TL;DR
REBEL is a minimalist reinforcement learning algorithm that simplifies policy optimization by regressing relative rewards, matching or surpassing PPO's performance while being easier to implement and more efficient.
Contribution
It introduces REBEL, a lightweight RL algorithm that reduces policy optimization to relative reward regression, with strong theoretical guarantees and practical success in language and image tasks.
Findings
REBEL matches or exceeds PPO and DPO performance.
REBEL is simpler to implement and more computationally efficient.
REBEL performs well in fine-tuning large language models.
Abstract
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping), and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative reward between two completions to a prompt in terms of the policy, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Cornell-AGI/REBEL-OpenChat-3.5model· 14 dl· ♡ 114 dl♡ 1
- 🤗Cornell-AGI/REBEL-Llama-3model· 2 dl· ♡ 12 dl♡ 1
- 🤗Cornell-AGI/REBEL-Llama-3-epoch_2model· 8 dl· ♡ 38 dl♡ 3
- 🤗Cornell-AGI/REBEL-Llama-3-Armo-iter_1model· 2 dl· ♡ 12 dl♡ 1
- 🤗Cornell-AGI/REBEL-Llama-3-Armo-iter_2model· ♡ 1♡ 1
- 🤗Cornell-AGI/REBEL-Llama-3-Armo-iter_3model· 1 dl· ♡ 21 dl♡ 2
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization
