On the Convergence of Memory-Based Distributed SGD
Shen-Yi Zhao, Hao Gao, Wu-Jun Li

TL;DR
This paper develops a universal convergence analysis for memory-based distributed SGD with momentum, providing theoretical guarantees for both convex and non-convex problems and bridging the gap between theory and practice.
Contribution
It introduces a transformation equation linking M-DSGD to traditional DSGD, enabling convergence analysis with momentum and stagewise learning strategies.
Findings
Provides convergence rates for M-DSGD with momentum in convex and non-convex settings.
Introduces a transformation equation to relate M-DSGD to DSGD for theoretical analysis.
Bridges the gap between theoretical convergence and practical implementations.
Abstract
Distributed stochastic gradient descent~(DSGD) has been widely used for optimizing large-scale machine learning models, including both convex and non-convex models. With the rapid growth of model size, huge communication cost has been the bottleneck of traditional DSGD. Recently, many communication compression methods have been proposed. Memory-based distributed stochastic gradient descent~(M-DSGD) is one of the efficient methods since each worker communicates a sparse vector in each iteration so that the communication cost is small. Recent works propose the convergence rate of M-DSGD when it adopts vanilla SGD. However, there is still a lack of convergence theory for M-DSGD when it adopts momentum SGD. In this paper, we propose a universal convergence analysis for M-DSGD by introducing \emph{transformation equation}. The transformation equation describes the relation between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Sparse and Compressive Sensing Techniques
MethodsStochastic Gradient Descent
