ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun,, Zhi-Quan Luo

TL;DR
ReMax is a simplified and efficient reinforcement learning method for aligning large language models, reducing complexity and resource usage while achieving state-of-the-art results.
Contribution
ReMax introduces a new RLHF approach based on REINFORCE that is simpler, more memory-efficient, and outperforms PPO in training large language models.
Findings
ReMax reduces GPU memory usage by 46% compared to PPO.
ReMax achieves a 94.78% win rate on AlpacaEval leaderboard.
ReMax sets new SOTA for open-source 7B models on MT-bench.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is key to aligning Large Language Models (LLMs), typically paired with the Proximal Policy Optimization (PPO) algorithm. While PPO is a powerful method designed for general reinforcement learning tasks, it is overly sophisticated for LLMs, leading to laborious hyper-parameter tuning and significant computation burdens. To make RLHF efficient, we present ReMax, which leverages 3 properties of RLHF: fast simulation, deterministic transitions, and trajectory-level rewards. These properties are not exploited in PPO, making it less suitable for RLHF. Building on the renowned REINFORCE algorithm, ReMax does not require training an additional value model as in PPO and is further enhanced with a new variance reduction technique. ReMax offers several benefits over PPO: it is simpler to implement, eliminates more than 4 hyper-parameters in PPO,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsZeRO · Entropy Regularization · Proximal Policy Optimization · REINFORCE
