How to scale distributed deep learning?
Peter H. Jin, Qiaochu Yuan, Forrest Iandola, Kurt Keutzer

TL;DR
This paper compares synchronous and asynchronous distributed SGD for deep learning, introduces gossiping SGD as a hybrid approach, and analyzes their convergence and scalability on large-scale image classification tasks.
Contribution
It provides a comprehensive comparison of distributed SGD methods and proposes gossiping SGD to combine their advantages.
Findings
Asynchronous SGD converges faster at fewer nodes.
Synchronous SGD scales better at larger node counts.
Gossiping SGD offers a promising hybrid approach.
Abstract
Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS). To minimize training time, the training of a deep neural network must be scaled beyond a single machine to as many machines as possible by distributing the optimization method used for training. While a number of approaches have been proposed for distributed stochastic gradient descent (SGD), at the current time synchronous approaches to distributed SGD appear to be showing the greatest performance at large scale. Synchronous scaling of SGD suffers from the need to synchronize all processors on each gradient step and is not resilient in the face of failing or lagging processors. In asynchronous approaches using parameter servers, training is slowed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Machine Learning and ELM
MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection
