Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning
Zeyuan Allen-Zhu, Yuanzhi Li

TL;DR
This paper develops a theory explaining how ensembles of neural networks improve test accuracy through multi-view data structures and how this knowledge can be distilled into single models, revealing new insights into deep learning ensemble methods.
Contribution
It introduces a novel theoretical framework showing ensemble and knowledge distillation effectiveness in deep learning differ from traditional theories, especially under multi-view data structures.
Findings
Ensemble of neural networks can provably improve test accuracy with multi-view data.
Knowledge distillation can transfer ensemble performance into a single model.
Self-distillation combines ensemble and knowledge distillation effects to enhance accuracy.
Abstract
We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the SAME architecture, trained using the SAME algorithm on the SAME data set, and they only differ by the random seeds used in the initialization. We show that ensemble/knowledge distillation in Deep Learning works very differently from traditional learning theory (such as boosting or NTKs, neural tangent kernels). To properly understand them, we develop a theory showing that when data has a structure we refer to as ``multi-view'', then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsKnowledge Distillation
