Can Decentralized Algorithms Outperform Centralized Algorithms? A Case   Study for Decentralized Parallel Stochastic Gradient Descent

Xiangru Lian; Ce Zhang; Huan Zhang; Cho-Jui Hsieh; Wei Zhang; Ji Liu

arXiv:1705.09056·math.OC·September 12, 2017·406 cites

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, Ji Liu

PDF

Open Access 3 Repos

TL;DR

This paper investigates whether decentralized parallel stochastic gradient descent algorithms can outperform centralized ones by reducing communication costs, supported by theoretical analysis and extensive empirical validation across multiple platforms and network conditions.

Contribution

The paper provides the first theoretical analysis showing regimes where decentralized algorithms can outperform centralized algorithms in distributed stochastic gradient descent.

Findings

01

D-PSGD can be up to ten times faster than C-PSGD in low bandwidth or high latency networks.

02

Decentralized algorithms have comparable computational complexity but lower communication costs.

03

Empirical validation across CNTK, Torch, and multiple GPU configurations supports the theoretical results.

Abstract

Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart? Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Distributed Control Multi-Agent Systems · Age of Information Optimization