How to scale distributed deep learning?

Peter H. Jin; Qiaochu Yuan; Forrest Iandola; Kurt Keutzer

arXiv:1611.04581·cs.LG·November 15, 2016·53 cites

How to scale distributed deep learning?

Peter H. Jin, Qiaochu Yuan, Forrest Iandola, Kurt Keutzer

PDF

Open Access

TL;DR

This paper compares synchronous and asynchronous distributed SGD for deep learning, introduces gossiping SGD as a hybrid approach, and analyzes their convergence and scalability on large-scale image classification tasks.

Contribution

It provides a comprehensive comparison of distributed SGD methods and proposes gossiping SGD to combine their advantages.

Findings

01

Asynchronous SGD converges faster at fewer nodes.

02

Synchronous SGD scales better at larger node counts.

03

Gossiping SGD offers a promising hybrid approach.

Abstract

Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS). To minimize training time, the training of a deep neural network must be scaled beyond a single machine to as many machines as possible by distributing the optimization method used for training. While a number of approaches have been proposed for distributed stochastic gradient descent (SGD), at the current time synchronous approaches to distributed SGD appear to be showing the greatest performance at large scale. Synchronous scaling of SGD suffers from the need to synchronize all processors on each gradient step and is not resilient in the face of failing or lagging processors. In asynchronous approaches using parameter servers, training is slowed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Machine Learning and ELM

MethodsAverage Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization · Max Pooling · Residual Connection