Asynchronous Decentralized Parallel Stochastic Gradient Descent

Xiangru Lian; Wei Zhang; Ce Zhang; Ji Liu

arXiv:1710.06952·math.OC·September 26, 2018·68 cites

Asynchronous Decentralized Parallel Stochastic Gradient Descent

Xiangru Lian, Wei Zhang, Ce Zhang, Ji Liu

PDF

Open Access 3 Repos

TL;DR

This paper introduces AD-PSGD, an asynchronous decentralized stochastic gradient descent algorithm that is robust in heterogeneous environments, communication-efficient, and achieves optimal convergence rates, outperforming existing methods especially at large GPU scales.

Contribution

The paper presents AD-PSGD, the first asynchronous decentralized SGD algorithm with optimal convergence and linear speedup, suitable for large-scale heterogeneous distributed systems.

Findings

01

AD-PSGD converges at the optimal $O(1/ oot{K})$ rate.

02

AD-PSGD outperforms existing decentralized and asynchronous SGD methods.

03

Training ResNet-50 on ImageNet with 128 GPUs, AD-PSGD achieves similar convergence to AllReduce-SGD with 4-8X faster epochs.

Abstract

Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a parameter server suffer from 1) communication bottleneck at parameter servers when workers are many, and 2) significantly worse convergence when the traffic to parameter server is congested. Can we design an algorithm that is robust in a heterogeneous environment, while being communication efficient and maintaining the best-possible convergence rate? In this paper, we propose an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations. Our theoretical analysis shows AD-PSGD converges at the optimal $O (1/ K)$ rate as SGD and has linear speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Sparse and Compressive Sensing Techniques

MethodsStochastic Gradient Descent