On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
Xueyan Niu, Bo Bai, Wei Han, Weixi Zhang

TL;DR
This paper proves that supervised fine-tuning and reinforcement learning cannot be decoupled in post-training of large language models without performance loss, supported by theoretical analysis and experiments.
Contribution
It provides the first theoretical analysis showing the inherent non-decoupling of SFT and RL in post-training, with practical validation on Qwen3-0.6B.
Findings
RL increases SFT loss under both distributional and landscape analyses.
SFT lowers RL-derived reward under similar conditions.
Experiments confirm the predicted degradation when decoupling SFT and RL.
Abstract
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under both distributional (KL-based) and landscape (PL-based) analyses; and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL under analogous conditions. Under the PL condition, we further derive the optimal RL duration that balances reward improvement against SFT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
