Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

Shiwan Zhao; Zhihu Wang; Xuyang Zhao; Jiaming Zhou; Caiyue Xu; Chenfei Liu; Liting Zhang; Yuhang Jia; Yanzhe Zhang; Hualong Yu; Zichen Xu; Qicheng Li; Yong Qin

arXiv:2604.07941·cs.CL·April 17, 2026

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

Shiwan Zhao, Zhihu Wang, Xuyang Zhao, Jiaming Zhou, Caiyue Xu, Chenfei Liu, Liting Zhang, Yuhang Jia, Yanzhe Zhang, Hualong Yu, Zichen Xu, Qicheng Li, Yong Qin

PDF

TL;DR

This survey offers a unified framework for understanding large language model post-training, emphasizing structured interventions like support expansion and policy reshaping across different training regimes.

Contribution

It introduces a comprehensive view organizing post-training methods by behavioral bottlenecks and trajectory provenance, unifying diverse techniques under a systems-level perspective.

Findings

01

Supports expansion and reshaping are key roles in post-training.

02

Distillation is better viewed as behavioral consolidation.

03

Hybrid multi-stage pipelines are increasingly important.

Abstract

Post-training has become central to turning pretrained large language models (LLMs) into aligned, capable, and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objectives rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary regimes: off-policy learning on externally supplied trajectories and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.