PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

Yuanda Xu; Hejian Sang; Zhengze Zhou; Ran He; Zhipeng Wang

arXiv:2603.11178·cs.AI·April 13, 2026

PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang

PDF

TL;DR

PACED improves large language model distillation by focusing training on problems where the student is in the zone of proximal development, leading to state-of-the-art results and reduced forgetting.

Contribution

It introduces a novel weighting scheme based on pass rate that optimizes training efficiency without architectural changes or hyperparameters.

Findings

01

PACED outperforms unweighted distillation by up to 8.2 points.

02

It reduces forgetting to 1.4% in distillation.

03

A two-stage KL schedule further improves results by 5.8 points.

Abstract

Standard LLM distillation treats all training problems equally -- wasting compute on problems the student has already mastered or cannot yet solve. We empirically show that this inefficiency has a precise gradient-level signature: the cross-problem gradient signal-to-noise ratio (SNR) follows a bell curve over student pass rate, collapsing at both extremes. We propose PACED, which weights each problem by $w (p) = p (1 - p)$ where $p$ is the student's empirical pass rate -- concentrating training on the zone of proximal development. This requires only student rollouts, no architectural changes, and no hyperparameters. We prove the Beta kernel $w (p) = p^{α} (1 - p)^{β}$ is the leading-order optimal weight family arising from the SNR boundary-collapse structure, and is minimax-robust under misspecification (worst-case efficiency loss $O (δ^{2})$ ). Across Qwen3, Qwen2.5, and Llama-3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.