Reinforcement Learning for LLM Post-Training: A Survey

Zhichao Wang; Kiran Ramnath; Bin Bi; Shiva Kumar Pentyala; Sougata Chaudhuri; Shubham Mehrotra; Zixu (James) Zhu; Xiang-Bo Mao; Sitaram Asur; Na (Claire) Cheng

arXiv:2407.16216·cs.CL·May 19, 2026

Reinforcement Learning for LLM Post-Training: A Survey

Zhichao Wang, Kiran Ramnath, Bin Bi, Shiva Kumar Pentyala, Sougata Chaudhuri, Shubham Mehrotra, Zixu (James) Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire) Cheng

PDF

1 Datasets

TL;DR

This survey comprehensively reviews reinforcement learning-based post-training methods for large language models, unifying various approaches under a single policy gradient framework and providing detailed technical comparisons.

Contribution

It introduces a unified policy gradient framework for RLHF and RLVR methods, offering detailed analysis, standardized notation, and empirical comparisons to serve as a technical reference.

Findings

01

Unified framework connects pretraining, SFT, RLHF, and RLVR.

02

Detailed analysis of PPO, GRPO, and DPO methods.

03

Empirical results and implementation details compared.

Abstract

Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training methods, including Reinforcement Learning from Human Feedback (RLHF) methods like Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) approaches like PPO and GRPO, have made remarkable gains to alleviate these issues. Yet, no existing work offers a technically detailed comparison of the various methods driving this progress. In order to fill this gap, we present a timely survey that connects foundational components with latest advancements. We derive a single policy gradient framework that unifies pretraining, SFT, RLHF, and RLVR as special cases while also organizing the more recent techniques therein. The main…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

BAAI/SurveyScope
dataset· 30 dl
30 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.