KL for a KL: On-Policy Distillation with Control Variate Baseline
Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, Yohan Jo

TL;DR
This paper introduces vOPD, a stabilized on-policy distillation method for large language models that reduces gradient variance using a control variate based on a closed-form value function, improving training stability and efficiency.
Contribution
vOPD applies RL-inspired variance reduction to on-policy distillation, using a closed-form value function and top-k approximation to enhance stability and efficiency.
Findings
vOPD outperforms vanilla OPD on reasoning benchmarks.
vOPD matches full-vocabulary baselines with lower computational cost.
A top-k baseline approximation further reduces cost without performance loss.
Abstract
On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
