PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang

TL;DR
PACED improves large language model distillation by focusing training on problems where the student is in the zone of proximal development, leading to state-of-the-art results and reduced forgetting.
Contribution
It introduces a novel weighting scheme based on pass rate that optimizes training efficiency without architectural changes or hyperparameters.
Findings
PACED outperforms unweighted distillation by up to 8.2 points.
It reduces forgetting to 1.4% in distillation.
A two-stage KL schedule further improves results by 5.8 points.
Abstract
Standard LLM distillation treats all training problems equally -- wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes. We propose PACED, which weights each problem by where is the student's empirical pass rate -- concentrating training on the zone of proximal development. This requires only student rollouts, no architectural changes, and no hyperparameters. We prove the Beta kernel is the leading-order optimal weight family arising from the SNR boundary-collapse structure, and is minimax-robust under misspecification (worst-case efficiency loss ). Across Qwen3, Qwen2.5, and Llama-3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
