Trade-offs of Local SGD at Scale: An Empirical Study

Jose Javier Gonzalez Ortiz; Jonathan Frankle; Mike Rabbat; Ari Morcos,; Nicolas Ballas

arXiv:2110.08133·cs.LG·October 18, 2021

Trade-offs of Local SGD at Scale: An Empirical Study

Jose Javier Gonzalez Ortiz, Jonathan Frankle, Mike Rabbat, Ari Morcos,, Nicolas Ballas

PDF

Open Access

TL;DR

This paper empirically investigates local SGD at scale, revealing a trade-off between reduced communication and lower accuracy, and proposes momentum techniques to improve outcomes without extra communication.

Contribution

It provides the first large-scale empirical analysis of local SGD, highlighting its challenges and potential improvements through momentum methods.

Findings

01

Local SGD reduces communication but lowers accuracy at scale.

02

Incorporating slow momentum improves accuracy without additional communication.

03

Trade-offs become more pronounced in large-scale distributed training.

Abstract

As datasets and models become increasingly large, distributed training has become a necessary component to allow deep neural networks to train in reasonable amounts of time. However, distributed training can have substantial communication overhead that hinders its scalability. One strategy for reducing this overhead is to perform multiple unsynchronized SGD steps independently on each worker between synchronization steps, a technique known as local SGD. We conduct a comprehensive empirical study of local SGD and related methods on a large-scale image classification task. We find that performing local SGD comes at a price: lower communication costs (and thereby faster training) are accompanied by lower accuracy. This finding is in contrast from the smaller-scale experiments in prior work, suggesting that local SGD encounters challenges at scale. We further show that incorporating the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning

MethodsLocal SGD · Stochastic Gradient Descent