Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

Dong Nie

arXiv:2605.22731·cs.LG·May 22, 2026

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

Dong Nie

PDF

TL;DR

This paper presents a state distribution perspective on post-training methods for large language models, highlighting how the source and locality of training states influence retention and performance.

Contribution

It formalizes post-training as state-distribution shaping and provides empirical evidence that state management impacts model retention and effectiveness.

Findings

01

Mild SFT improves GSM8K with little forgetting

02

Stress SFT causes substantial retention loss

03

OPD from degraded SFT surpasses the teacher on multiple benchmarks

Abstract

Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.