Neighbourhood Distillation: On the benefits of non end-to-end distillation
La\"etitia Shao, Max Moroz, Elad Eban, Yair Movshovitz-Attias

TL;DR
This paper introduces a non end-to-end approach to knowledge distillation by splitting neural networks into smaller sub-networks, which improves training efficiency, reusability, and simplicity, especially for large models.
Contribution
It proposes a novel neighborhood distillation method that breaks away from end-to-end training, enabling parallel training, better reusability, and easier training with synthetic data.
Findings
Speeds up knowledge distillation through parallelism.
Facilitates neural architecture search by reusing neighborhoods.
Easier training of smaller networks with synthetic data.
Abstract
End-to-end training with back propagation is the standard method for training deep neural networks. However, as networks become deeper and bigger, end-to-end training becomes more challenging: highly non-convex models gets stuck easily in local optima, gradients signals are prone to vanish or explode during back-propagation, training requires computational resources and time. In this work, we propose to break away from the end-to-end paradigm in the context of Knowledge Distillation. Instead of distilling a model end-to-end, we propose to split it into smaller sub-networks - also called neighbourhoods - that are then trained independently. We empirically show that distilling networks in a non end-to-end fashion can be beneficial in a diverse range of use cases. First, we show that it speeds up Knowledge Distillation by exploiting parallelism and training on smaller networks. Second, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
MethodsKnowledge Distillation
