Distilling Double Descent
Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit, Singh Rawat, Sashank J. Reddi, Yichen Zhou

TL;DR
This paper demonstrates that large, overparameterized teachers can produce hard labels for unlabeled data, enabling student models to outperform traditional distillation methods by leveraging double descent phenomena for better generalization.
Contribution
It introduces a novel approach to distillation that exploits double descent, showing that overparameterized teachers and large unlabeled datasets improve student model performance.
Findings
Overparameterized teachers avoid overfitting through double descent.
Students trained on large unlabeled datasets labeled by teachers outperform traditional methods.
Large datasets and overparameterization enhance generalization in distillation.
Abstract
Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with \emph{soft} labels, \eg probabilities or confidences, from the teacher model. In this work, we show, that, even when the teacher model is highly overparameterized, and provides \emph{hard} labels, using a very large held-out unlabeled dataset to train the student model can result in a model that outperforms more "traditional" approaches. Our explanation for this phenomenon is based on recent work on "double descent". It has been observed that, once a model's complexity roughly exceeds the amount required to memorize the training data, increasing the complexity \emph{further} can,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
