Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin

TL;DR
This paper introduces a generalized on-policy distillation framework with reward extrapolation, demonstrating improved performance and the ability to surpass teacher models in various tasks through theoretical insights and extensive experiments.
Contribution
It extends standard on-policy distillation by incorporating a flexible reference model and reward scaling, enabling reward extrapolation and improved student performance.
Findings
Reward extrapolation (ExOPD) improves distillation outcomes.
ExOPD enables students to surpass teacher performance.
Reward correction with a reference model enhances distillation accuracy.
Abstract
On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Domain Adaptation and Few-Shot Learning · Online Learning and Analytics
