Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu, Chenfei Liu, Liting Zhang, Yuhang Jia, Yanzhe Zhang, Hualong Yu, Zichen Xu, Qicheng Li, Yong Qin

TL;DR
This survey offers a unified framework for understanding large language model post-training, emphasizing structured interventions like support expansion and policy reshaping across different training regimes.
Contribution
It introduces a comprehensive view organizing post-training methods by behavioral bottlenecks and trajectory provenance, unifying diverse techniques under a systems-level perspective.
Findings
Supports expansion and reshaping are key roles in post-training.
Distillation is better viewed as behavioral consolidation.
Hybrid multi-stage pipelines are increasingly important.
Abstract
Post-training has become central to turning pretrained large language models (LLMs) into aligned, capable, and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objectives rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary regimes: off-policy learning on externally supplied trajectories and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
