On the Convergence of Memory-Based Distributed SGD

Shen-Yi Zhao; Hao Gao; Wu-Jun Li

arXiv:1905.12960·stat.ML·May 31, 2019·1 cites

On the Convergence of Memory-Based Distributed SGD

Shen-Yi Zhao, Hao Gao, Wu-Jun Li

PDF

Open Access

TL;DR

This paper develops a universal convergence analysis for memory-based distributed SGD with momentum, providing theoretical guarantees for both convex and non-convex problems and bridging the gap between theory and practice.

Contribution

It introduces a transformation equation linking M-DSGD to traditional DSGD, enabling convergence analysis with momentum and stagewise learning strategies.

Findings

01

Provides convergence rates for M-DSGD with momentum in convex and non-convex settings.

02

Introduces a transformation equation to relate M-DSGD to DSGD for theoretical analysis.

03

Bridges the gap between theoretical convergence and practical implementations.

Abstract

Distributed stochastic gradient descent~(DSGD) has been widely used for optimizing large-scale machine learning models, including both convex and non-convex models. With the rapid growth of model size, huge communication cost has been the bottleneck of traditional DSGD. Recently, many communication compression methods have been proposed. Memory-based distributed stochastic gradient descent~(M-DSGD) is one of the efficient methods since each worker communicates a sparse vector in each iteration so that the communication cost is small. Recent works propose the convergence rate of M-DSGD when it adopts vanilla SGD. However, there is still a lack of convergence theory for M-DSGD when it adopts momentum SGD. In this paper, we propose a universal convergence analysis for M-DSGD by introducing \emph{transformation equation}. The transformation equation describes the relation between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Sparse and Compressive Sensing Techniques

MethodsStochastic Gradient Descent