ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method   for Aligning Large Language Models

Ziniu Li; Tian Xu; Yushun Zhang; Zhihang Lin; Yang Yu; Ruoyu Sun,; Zhi-Quan Luo

arXiv:2310.10505·cs.LG·May 17, 2024·6 cites

ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models

Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun,, Zhi-Quan Luo

PDF

Open Access 3 Repos 4 Models

TL;DR

ReMax is a simplified and efficient reinforcement learning method for aligning large language models, reducing complexity and resource usage while achieving state-of-the-art results.

Contribution

ReMax introduces a new RLHF approach based on REINFORCE that is simpler, more memory-efficient, and outperforms PPO in training large language models.

Findings

01

ReMax reduces GPU memory usage by 46% compared to PPO.

02

ReMax achieves a 94.78% win rate on AlpacaEval leaderboard.

03

ReMax sets new SOTA for open-source 7B models on MT-bench.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is key to aligning Large Language Models (LLMs), typically paired with the Proximal Policy Optimization (PPO) algorithm. While PPO is a powerful method designed for general reinforcement learning tasks, it is overly sophisticated for LLMs, leading to laborious hyper-parameter tuning and significant computation burdens. To make RLHF efficient, we present ReMax, which leverages 3 properties of RLHF: fast simulation, deterministic transitions, and trajectory-level rewards. These properties are not exploited in PPO, making it less suitable for RLHF. Building on the renowned REINFORCE algorithm, ReMax does not require training an additional value model as in PPO and is further enhanced with a new variance reduction technique. ReMax offers several benefits over PPO: it is simpler to implement, eliminates more than 4 hyper-parameters in PPO,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsZeRO · Entropy Regularization · Proximal Policy Optimization · REINFORCE