Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
Miaosen Zhang, Yishan Liu, Shuxia Lin, Xu Yang, Qi Dai, Chong Luo, Weihao Jiang, Peng Hou, Anxiang Zeng, Xin Geng, Baining Guo

TL;DR
This paper introduces a novel framework for on-policy supervised fine-tuning of large language models, using Distribution Discriminant Theory to improve generalization and align training data with model distribution, outperforming some RL methods.
Contribution
It proposes Distribution Discriminant Theory and two techniques, IDFT and Hinted Decoding, to enable effective on-policy SFT, bridging the gap with reinforcement learning.
Findings
Outperforms offline RL algorithms like DPO and SimPO in generalization.
Enhances SFT efficiency while achieving superior performance.
Provides practical methods for on-policy training in large language models.
Abstract
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present \textbf{\textit{Distribution Discriminant Theory (DDT)}}, which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) \textbf{\textit{In-Distribution Finetuning (IDFT)}}, a loss-level method to enhance generalization ability of SFT, and (ii) \textbf{\textit{Hinted Decoding}}, a data-level technique that can re-align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance surpassing prominent offline RL algorithms, including DPO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Imbalanced Data Classification Techniques
