Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation
Dong Nie

TL;DR
This paper presents a state distribution perspective on post-training methods for large language models, highlighting how the source and locality of training states influence retention and performance.
Contribution
It formalizes post-training as state-distribution shaping and provides empirical evidence that state management impacts model retention and effectiveness.
Findings
Mild SFT improves GSM8K with little forgetting
Stress SFT causes substantial retention loss
OPD from degraded SFT surpasses the teacher on multiple benchmarks
Abstract
Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
