Asynchronous Decentralized Parallel Stochastic Gradient Descent
Xiangru Lian, Wei Zhang, Ce Zhang, Ji Liu

TL;DR
This paper introduces AD-PSGD, an asynchronous decentralized stochastic gradient descent algorithm that is robust in heterogeneous environments, communication-efficient, and achieves optimal convergence rates, outperforming existing methods especially at large GPU scales.
Contribution
The paper presents AD-PSGD, the first asynchronous decentralized SGD algorithm with optimal convergence and linear speedup, suitable for large-scale heterogeneous distributed systems.
Findings
AD-PSGD converges at the optimal $O(1/ oot{K})$ rate.
AD-PSGD outperforms existing decentralized and asynchronous SGD methods.
Training ResNet-50 on ImageNet with 128 GPUs, AD-PSGD achieves similar convergence to AllReduce-SGD with 4-8X faster epochs.
Abstract
Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a parameter server suffer from 1) communication bottleneck at parameter servers when workers are many, and 2) significantly worse convergence when the traffic to parameter server is congested. Can we design an algorithm that is robust in a heterogeneous environment, while being communication efficient and maintaining the best-possible convergence rate? In this paper, we propose an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations. Our theoretical analysis shows AD-PSGD converges at the optimal rate as SGD and has linear speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Sparse and Compressive Sensing Techniques
MethodsStochastic Gradient Descent
