Large scale distributed neural network training through online distillation
Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George, E. Dahl, Geoffrey E. Hinton

TL;DR
This paper introduces online distillation, a simple and scalable method for large-scale neural network training that improves speed, parallelism, and reproducibility without complex setups.
Contribution
The paper proposes online distillation as an easy-to-implement technique that enhances training efficiency and reproducibility in large neural network models.
Findings
Enables training on very large datasets twice as fast.
Maintains speedup benefits even with high parallelism.
Improves reproducibility of model predictions.
Abstract
Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Adversarial Robustness in Machine Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
