Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

Xin Yu; Liuchen Liao; Yiwen Zhang; Yingchen Yu; Lingzhou Xue; Qinzhen Guo

arXiv:2605.05040·cs.LG·May 7, 2026

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

Xin Yu, Liuchen Liao, Yiwen Zhang, Yingchen Yu, Lingzhou Xue, Qinzhen Guo

PDF

TL;DR

This paper introduces PBSD, a novel preference-based self-distillation method that surpasses traditional KL matching, enhancing training stability and performance in on-policy distillation tasks.

Contribution

PBSD moves beyond fixed-teacher KL matching by incorporating reward regularization, leading to superior target policies and improved training stability.

Findings

01

PBSD outperforms baselines on reasoning and tool-use benchmarks.

02

PBSD demonstrates improved training stability over prior self-distillation methods.

03

PBSD maintains token efficiency across multiple model scales.

Abstract

On-policy distillation is an efficient alternative to reinforcement learning, offering dense token-level training signals. However, its reliance on a stronger external teacher has driven recent work on on-policy self-distillation, where the same model serves as both teacher and student under different prompt contexts. Yet, existing self-distillation methods largely reduce learning to KL matching toward the context-augmented teacher model. This approach often suffers from training instability and can degrade reasoning performance over time. Moreover, self-distillation from the same model with prompt augmentation lacks the exploratory diversity provided by a genuine external teacher. To address these limitations, we move beyond fixed-teacher KL matching and propose \textbf{P}reference-\textbf{B}ased \textbf{S}elf-\textbf{D}istillation (\textbf{PBSD}), which revisits on-policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.