Stable On-Policy Distillation through Adaptive Target Reformulation
Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, Taesup Kim

TL;DR
This paper introduces Veto, a novel objective reformulation for on-policy knowledge distillation that stabilizes training and improves performance by bridging the gap between teacher and student models.
Contribution
Veto provides an adaptive, geometric target reformulation in logit space, balancing stability and diversity in on-policy distillation.
Findings
Veto outperforms supervised fine-tuning on multiple tasks.
Veto stabilizes training by suppressing harmful gradients.
Veto balances reward and diversity effectively.
Abstract
Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from large language models to smaller student models; however, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities because the distributional gap between the novice student and the expert teacher is often too wide to bridge directly. These challenges manifest as pathological gradients in forward KL objectives or diversity collapse in reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric bridge in the logit space. Unlike prior methods that mix data samples, Veto creates an intermediate target distribution that promotes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
