Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

Miaosen Zhang; Yishan Liu; Shuxia Lin; Xu Yang; Qi Dai; Chong Luo; Weihao Jiang; Peng Hou; Anxiang Zeng; Xin Geng; Baining Guo

arXiv:2602.12222·cs.LG·March 17, 2026

Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

Miaosen Zhang, Yishan Liu, Shuxia Lin, Xu Yang, Qi Dai, Chong Luo, Weihao Jiang, Peng Hou, Anxiang Zeng, Xin Geng, Baining Guo

PDF

Open Access

TL;DR

This paper introduces a novel framework for on-policy supervised fine-tuning of large language models, using Distribution Discriminant Theory to improve generalization and align training data with model distribution, outperforming some RL methods.

Contribution

It proposes Distribution Discriminant Theory and two techniques, IDFT and Hinted Decoding, to enable effective on-policy SFT, bridging the gap with reinforcement learning.

Findings

01

Outperforms offline RL algorithms like DPO and SimPO in generalization.

02

Enhances SFT efficiency while achieving superior performance.

03

Provides practical methods for on-policy training in large language models.

Abstract

Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present \textbf{\textit{Distribution Discriminant Theory (DDT)}}, which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) \textbf{\textit{In-Distribution Finetuning (IDFT)}}, a loss-level method to enhance generalization ability of SFT, and (ii) \textbf{\textit{Hinted Decoding}}, a data-level technique that can re-align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance surpassing prominent offline RL algorithms, including DPO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Imbalanced Data Classification Techniques