SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum
Jianyu Wang, Vinayak Tantia, Nicolas Ballas, Michael Rabbat

TL;DR
SlowMo introduces a communication-efficient distributed SGD method with periodic synchronization and momentum updates, improving convergence and generalization in large-scale training tasks while maintaining runtime efficiency.
Contribution
The paper proposes the SlowMo framework, providing the first theoretical convergence guarantees for BMUF and demonstrating practical improvements over standard distributed optimization methods.
Findings
SlowMo improves optimization and generalization performance.
It maintains runtime comparable to base optimizers.
Theoretical convergence to stationary points is established.
Abstract
Distributed optimization is essential for training large models on large datasets. Multiple approaches have been proposed to reduce the communication overhead in distributed training, such as synchronizing only after performing multiple local SGD steps, and decentralized methods (e.g., using gossip algorithms) to decouple communications among workers. Although these methods run faster than AllReduce-based methods, which use blocking communication before every update, the resulting models may be less accurate after the same number of updates. Inspired by the BMUF method of Chen & Huo (2016), we propose a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm. Experiments on image classification and machine translation tasks demonstrate that SlowMo consistently yields improvements…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Privacy-Preserving Technologies in Data
MethodsLocal SGD · SlowMo · Stochastic Gradient Descent
