Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gall\'e, Marzieh Fadaee, Julia, Kreutzer, Olivier Pietquin, Ahmet \"Ust\"un, Sara Hooker

TL;DR
This paper demonstrates that simpler REINFORCE-style optimization methods outperform PPO and other recent approaches in RLHF for LLMs, offering a more efficient and effective way to align models with human preferences.
Contribution
The paper shows that many components of PPO are unnecessary for RLHF and that simpler REINFORCE variants can achieve better performance with lower computational cost.
Findings
REINFORCE variants outperform PPO in RLHF tasks
Simpler methods reduce computational costs significantly
Alignment with human preferences is improved using basic RL techniques
Abstract
AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the formulation of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗blakenp/Qwen2.5-1.5B-Policymodel· 1 dl1 dl
- 🤗blakenp/Qwen2.5-1.5B-Policy2model· 4 dl4 dl
- 🤗SaminSkyfall/rloomodel· 6 dl6 dl
- 🤗leobianco/npov_PERL_google_S200898_eps10000_lr2e-5_kl1e-4_2507031331model
- 🤗Prathyusha101/qwen2-0.5b-rl00model
- 🤗Prathyusha101/qwen2-0.5b-REINFORCE-no-baseline-kl-disabledmodel· 1 dl1 dl
- 🤗leobianco/bosch_PERL_google_S130104_eps10000_lr2e-5_kl1e-4_2510291047model
- 🤗leobianco/bosch_PERL_google_S051179_eps10000_lr2e-5_kl1e-4_2510310721model
- 🤗leobianco/bosch_PERL_google_S130104_eps10000_lr2e-5_kl1e-4_2510310722model
- 🤗leobianco/bosch_PERL_google_S200898_eps10000_lr2e-5_kl1e-4_2511051107model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical and Computational Modeling · Artificial Intelligence in Law · Imbalanced Data Classification Techniques
MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization
