Reinforcement Learning for LLM Post-Training: A Survey
Zhichao Wang, Kiran Ramnath, Bin Bi, Shiva Kumar Pentyala, Sougata Chaudhuri, Shubham Mehrotra, Zixu (James) Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire) Cheng

TL;DR
This survey comprehensively reviews reinforcement learning-based post-training methods for large language models, unifying various approaches under a single policy gradient framework and providing detailed technical comparisons.
Contribution
It introduces a unified policy gradient framework for RLHF and RLVR methods, offering detailed analysis, standardized notation, and empirical comparisons to serve as a technical reference.
Findings
Unified framework connects pretraining, SFT, RLHF, and RLVR.
Detailed analysis of PPO, GRPO, and DPO methods.
Empirical results and implementation details compared.
Abstract
Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training methods, including Reinforcement Learning from Human Feedback (RLHF) methods like Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) approaches like PPO and GRPO, have made remarkable gains to alleviate these issues. Yet, no existing work offers a technically detailed comparison of the various methods driving this progress. In order to fill this gap, we present a timely survey that connects foundational components with latest advancements. We derive a single policy gradient framework that unifies pretraining, SFT, RLHF, and RLVR as special cases while also organizing the more recent techniques therein. The main…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
