Convergence Analysis of Decentralized ASGD
Mauro DL Tosi, Martin Theobald

TL;DR
This paper provides a new convergence analysis for decentralized asynchronous SGD (DASGD), removing the need for central coordination and broadening understanding of its efficiency for large-scale distributed machine learning.
Contribution
It introduces a novel convergence-rate bound for DASGD that applies to arbitrary network topologies and does not require partial synchronization.
Findings
DASGD converges with a rate of O(σε^{-2}) + O(QS_{avg}ε^{-3/2}) + O(S_{avg}ε^{-1}) under bounded gradients.
When gradients are unbounded, the convergence rate is O(σε^{-2}) + O(√(S_{avg}S_{max})ε^{-1}).
The analysis applies to non-convex, L-smooth functions with fixed stepsize.
Abstract
Over the last decades, Stochastic Gradient Descent (SGD) has been intensively studied by the Machine Learning community. Despite its versatility and excellent performance, the optimization of large models via SGD still is a time-consuming task. To reduce training time, it is common to distribute the training process across multiple devices. Recently, it has been shown that the convergence of asynchronous SGD (ASGD) will always be faster than mini-batch SGD. However, despite these improvements in the theoretical bounds, most ASGD convergence-rate proofs still rely on a centralized parameter server, which is prone to become a bottleneck when scaling out the gradient computations across many distributed processes. In this paper, we present a novel convergence-rate analysis for decentralized and asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Memory and Neural Computing · Molecular Communication and Nanonetworks
MethodsStochastic Gradient Descent
