KL for a KL: On-Policy Distillation with Control Variate Baseline

Minjae Oh; Sangjun Song; Gyubin Choi; Yunho Choi; Yohan Jo

arXiv:2605.07865·cs.LG·May 11, 2026

KL for a KL: On-Policy Distillation with Control Variate Baseline

Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, Yohan Jo

PDF

TL;DR

This paper introduces vOPD, a stabilized on-policy distillation method for large language models that reduces gradient variance using a control variate based on a closed-form value function, improving training stability and efficiency.

Contribution

vOPD applies RL-inspired variance reduction to on-policy distillation, using a closed-form value function and top-k approximation to enhance stability and efficiency.

Findings

01

vOPD outperforms vanilla OPD on reasoning benchmarks.

02

vOPD matches full-vocabulary baselines with lower computational cost.

03

A top-k baseline approximation further reduces cost without performance loss.

Abstract

On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.