Distilling Double Descent

Andrew Cotter; Aditya Krishna Menon; Harikrishna Narasimhan; Ankit; Singh Rawat; Sashank J. Reddi; Yichen Zhou

arXiv:2102.06849·cs.LG·February 16, 2021

Distilling Double Descent

Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit, Singh Rawat, Sashank J. Reddi, Yichen Zhou

PDF

Open Access

TL;DR

This paper demonstrates that large, overparameterized teachers can produce hard labels for unlabeled data, enabling student models to outperform traditional distillation methods by leveraging double descent phenomena for better generalization.

Contribution

It introduces a novel approach to distillation that exploits double descent, showing that overparameterized teachers and large unlabeled datasets improve student model performance.

Findings

01

Overparameterized teachers avoid overfitting through double descent.

02

Students trained on large unlabeled datasets labeled by teachers outperform traditional methods.

03

Large datasets and overparameterization enhance generalization in distillation.

Abstract

Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with \emph{soft} labels, \eg probabilities or confidences, from the teacher model. In this work, we show, that, even when the teacher model is highly overparameterized, and provides \emph{hard} labels, using a very large held-out unlabeled dataset to train the student model can result in a model that outperforms more "traditional" approaches. Our explanation for this phenomenon is based on recent work on "double descent". It has been observed that, once a model's complexity roughly exceeds the amount required to memorize the training data, increasing the complexity \emph{further} can,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification