Towards Understanding Knowledge Distillation

Mary Phuong; Christoph H. Lampert

arXiv:2105.13093·cs.LG·May 28, 2021·133 cites

Towards Understanding Knowledge Distillation

Mary Phuong, Christoph H. Lampert

PDF

Open Access

TL;DR

This paper provides the first theoretical insights into knowledge distillation, demonstrating how data geometry, optimization bias, and monotonicity influence the convergence and success of distillation in linear classifiers.

Contribution

It offers the first theoretical analysis of knowledge distillation, deriving a generalization bound and identifying key factors affecting its effectiveness.

Findings

01

Data geometry influences convergence speed.

02

Gradient descent finds favorable minima in distillation.

03

Expected risk decreases with larger training sets.

Abstract

Knowledge distillation, i.e., one classifier being trained on the outputs of another classifier, is an empirically very successful technique for knowledge transfer between classifiers. It has even been observed that classifiers learn much faster and more reliably if trained with the outputs of another classifier as soft labels, instead of from ground truth data. So far, however, there is no satisfactory theoretical explanation of this phenomenon. In this work, we provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers. Specifically, we prove a generalization bound that establishes fast convergence of the expected risk of a distillation-trained linear classifier. From the bound and its proof we extract three key factors that determine the success of distillation: * data geometry -- geometric properties of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Machine Learning and ELM