Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
Xin Yu, Liuchen Liao, Yiwen Zhang, Yingchen Yu, Lingzhou Xue, Qinzhen Guo

TL;DR
This paper introduces PBSD, a novel preference-based self-distillation method that surpasses traditional KL matching, enhancing training stability and performance in on-policy distillation tasks.
Contribution
PBSD moves beyond fixed-teacher KL matching by incorporating reward regularization, leading to superior target policies and improved training stability.
Findings
PBSD outperforms baselines on reasoning and tool-use benchmarks.
PBSD demonstrates improved training stability over prior self-distillation methods.
PBSD maintains token efficiency across multiple model scales.
Abstract
On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, where the same model serves as both teacher and student under different prompt contexts. Yet, existing self-distillation methods largely reduce learning to KL matching toward the context-augmented teacher model. This approach often suffers from training instability and can degrade reasoning performance over time. Moreover, self-distillation from the same model with prompt augmentation lacks the exploratory diversity provided by a genuine external teacher. To address these limitations, we move beyond fixed-teacher KL matching and propose \textbf{P}reference-\textbf{B}ased \textbf{S}elf-\textbf{D}istillation (\textbf{PBSD}), which revisits on-policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
