Global Momentum Compression for Sparse Communication in Distributed Learning
Chang-Wei Shi, Shen-Yi Zhao, Yin-Peng Xie, Hao Gao, Wu-Jun Li

TL;DR
This paper introduces GMC and GMC+, novel global momentum-based methods for sparse communication in distributed learning, demonstrating improved convergence and accuracy over local momentum approaches, especially with aggressive sparsification and non-IID data.
Contribution
It proposes the first global momentum approach for sparse communication in distributed learning, extending it with GMC+ for better convergence under aggressive sparsification.
Findings
GMC and GMC+ outperform local momentum methods in accuracy.
GMC and GMC+ converge faster, especially with non-IID data.
Theoretical convergence proofs for GMC and GMC+ are provided.
Abstract
With the rapid growth of data, distributed momentum stochastic gradient descent~(DMSGD) has been widely used in distributed learning, especially for training large-scale deep models. Due to the latency and limited bandwidth of the network, communication has become the bottleneck of distributed learning. Communication compression with sparsified gradient, abbreviated as \emph{sparse communication}, has been widely employed to reduce communication cost. All existing works about sparse communication in DMSGD employ local momentum, in which the momentum only accumulates stochastic gradients computed by each worker locally. In this paper, we propose a novel method, called \emph{\underline{g}}lobal \emph{\underline{m}}omentum \emph{\underline{c}}ompression~(GMC), for sparse communication. Different from existing works that utilize local momentum, GMC utilizes global momentum. Furthermore, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Privacy-Preserving Technologies in Data
MethodsStochastic Gradient Descent
