Trade-offs of Local SGD at Scale: An Empirical Study
Jose Javier Gonzalez Ortiz, Jonathan Frankle, Mike Rabbat, Ari Morcos,, Nicolas Ballas

TL;DR
This paper empirically investigates local SGD at scale, revealing a trade-off between reduced communication and lower accuracy, and proposes momentum techniques to improve outcomes without extra communication.
Contribution
It provides the first large-scale empirical analysis of local SGD, highlighting its challenges and potential improvements through momentum methods.
Findings
Local SGD reduces communication but lowers accuracy at scale.
Incorporating slow momentum improves accuracy without additional communication.
Trade-offs become more pronounced in large-scale distributed training.
Abstract
As datasets and models become increasingly large, distributed training has become a necessary component to allow deep neural networks to train in reasonable amounts of time. However, distributed training can have substantial communication overhead that hinders its scalability. One strategy for reducing this overhead is to perform multiple unsynchronized SGD steps independently on each worker between synchronization steps, a technique known as local SGD. We conduct a comprehensive empirical study of local SGD and related methods on a large-scale image classification task. We find that performing local SGD comes at a price: lower communication costs (and thereby faster training) are accompanied by lower accuracy. This finding is in contrast from the smaller-scale experiments in prior work, suggesting that local SGD encounters challenges at scale. We further show that incorporating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
MethodsLocal SGD · Stochastic Gradient Descent
