A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent
Shi Pu, Alex Olshevsky, Ioannis Ch. Paschalidis

TL;DR
This paper analyzes the transient time for distributed stochastic gradient descent (DSGD) to reach optimal convergence rates in noisy, networked environments, providing sharp bounds that depend on network properties and problem size.
Contribution
The paper characterizes the sharp transient time for DSGD to achieve asymptotic convergence, revealing its dependence on network spectral gap and problem size.
Findings
Transient time scales as n/(1-ρ_w)^2
Asymptotic convergence rate matches centralized SGD
Numerical experiments confirm theoretical bounds
Abstract
This paper is concerned with minimizing the average of cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate, which we show behaves as , where denotes the spectral gap of the mixing matrix. Moreover, we construct a "hard" optimization problem for which we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Distributed Control Multi-Agent Systems · Sparse and Compressive Sensing Techniques
