Progressive distillation induces an implicit curriculum
Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski,, Surbhi Goel

TL;DR
Progressive distillation creates an implicit curriculum that accelerates student learning by leveraging intermediate teacher checkpoints, leading to empirical and theoretical benefits across various tasks.
Contribution
This paper reveals the implicit curriculum mechanism in progressive distillation, demonstrating its benefits through theoretical analysis and empirical validation on multiple tasks.
Findings
Implicit curriculum accelerates learning in distillation.
Intermediate checkpoints provide unique training advantages.
Progressive distillation benefits extend to complex tasks like language modeling.
Abstract
Knowledge distillation leverages a teacher model to improve the training of a student model. A persistent challenge is that a better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several ``intermediate'' teachers. One empirically validated variant of this principle is progressive distillation, where the student learns from successive intermediate checkpoints of the teacher. Using sparse parity as a sandbox, we identify an implicit curriculum as one mechanism through which progressive distillation accelerates the student's learning. This curriculum is available only through the intermediate checkpoints but not the final converged one, and imparts both empirical acceleration and a provable sample complexity benefit to the student. We then extend our investigation to Transformers trained on probabilistic context-free…
Peer Reviews
Decision·ICLR 2025 Oral
Originality - The paper presents a novel perspective on progressive distillation by identifying and formalizing the concept of an "implicit curriculum." While prior work has explored progressive distillation empirically, this study delves deeper into the underlying mechanisms and provides theoretical grounding for its efficacy. The connection between intermediate teacher checkpoints and an implicit curriculum is a fresh insight that contributes to a better understanding of knowledge distillation
Some possible improvements I can see are the following :- a) more investigation of impact of temeprature on knowledge distillation. this seems to be a bit missing in the main sections of the paper b) analysis of how implicit curriculum learning varies across model layers, across datasets, trainign objecives etc c) exploring more tasks and architectures d ) explore interaction of optimization algorithms, batch size etc with curriculum learning
1. The paper is well written, and despite the complexity of the narrative, it is generally easy to follow, and I enjoyed the reading. 2. Though only applied to a simple use case, the mathematical analysis does provide useful insight about sample efficiency of progressive distillation. 3. The metrics selected in the analysis such as $\mathcal{M}_{robust}$ is quite useful to understand the feature learning aspect of the method. 4. The authors run experiments in various settings including three dat
Despite the strength, I think the paper can be improved. 1. I understand the necessity to use a toy use case (sparse parity) to show rigorous mathematical analysis, but the following experiments can be more practical in order to provide stronger empirical evidence of the effectiveness of progressive distillation. - Instead of masked token prediction, can run experiments in challenging NLP tasks such as QA, summarization and long-form generation. - Can also experiment with more recent LLMs - GPT-
* The paper is well-written and easy to read. * The paper includes results on tasks across different complexity levels - going from a toy setting of sparse parity to PCFGs and then to a non-synthetic task of natural language modeling. * Authors also run experiments across multiple model architectures, name MLPs and transformers of different sizes. * The induced curriculum is discussed from a human interpretability point of view (i.e. showing the correlation between degree 1 monomials and the log
* There is a typo in Definition 4.3: I believe it should be "boundary of span(n^{(i)})" instead of boundary of n^{(i)} * Discussion about how the relative sizes of teacher and student models were decided is missing. It would be interesting to see a study of how the performance is affected w.r.t the size of the student models * Empirical analysis on tasks in the vision domain and with other model architectures such as CNNs and recurrent networks would strengthen the paper significantly.
Videos
Taxonomy
TopicsSocioeconomic Development in MENA
