Large scale distributed neural network training through online   distillation

Rohan Anil; Gabriel Pereyra; Alexandre Passos; Robert Ormandi; George; E. Dahl; Geoffrey E. Hinton

arXiv:1804.03235·cs.LG·August 24, 2020·152 cites

Large scale distributed neural network training through online distillation

Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George, E. Dahl, Geoffrey E. Hinton

PDF

Open Access

TL;DR

This paper introduces online distillation, a simple and scalable method for large-scale neural network training that improves speed, parallelism, and reproducibility without complex setups.

Contribution

The paper proposes online distillation as an easy-to-implement technique that enhances training efficiency and reproducibility in large neural network models.

Findings

01

Enables training on very large datasets twice as fast.

02

Maintains speedup benefits even with high parallelism.

03

Improves reproducibility of model predictions.

Abstract

Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Adversarial Robustness in Machine Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings