SlowMo: Improving Communication-Efficient Distributed SGD with Slow   Momentum

Jianyu Wang; Vinayak Tantia; Nicolas Ballas; Michael Rabbat

arXiv:1910.00643·cs.LG·February 21, 2020·68 cites

SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum

Jianyu Wang, Vinayak Tantia, Nicolas Ballas, Michael Rabbat

PDF

Open Access 2 Repos

TL;DR

SlowMo introduces a communication-efficient distributed SGD method with periodic synchronization and momentum updates, improving convergence and generalization in large-scale training tasks while maintaining runtime efficiency.

Contribution

The paper proposes the SlowMo framework, providing the first theoretical convergence guarantees for BMUF and demonstrating practical improvements over standard distributed optimization methods.

Findings

01

SlowMo improves optimization and generalization performance.

02

It maintains runtime comparable to base optimizers.

03

Theoretical convergence to stationary points is established.

Abstract

Distributed optimization is essential for training large models on large datasets. Multiple approaches have been proposed to reduce the communication overhead in distributed training, such as synchronizing only after performing multiple local SGD steps, and decentralized methods (e.g., using gossip algorithms) to decouple communications among workers. Although these methods run faster than AllReduce-based methods, which use blocking communication before every update, the resulting models may be less accurate after the same number of updates. Inspired by the BMUF method of Chen & Huo (2016), we propose a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm. Experiments on image classification and machine translation tasks demonstrate that SlowMo consistently yields improvements…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Privacy-Preserving Technologies in Data

MethodsLocal SGD · SlowMo · Stochastic Gradient Descent