Understanding Top-k Sparsification in Distributed Deep Learning

Shaohuai Shi; Xiaowen Chu; Ka Chun Cheung; Simon See

arXiv:1911.08772·cs.LG·November 21, 2019·67 cites

Understanding Top-k Sparsification in Distributed Deep Learning

Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, Simon See

PDF

Open Access 1 Repo

TL;DR

This paper investigates the behavior of Top-k sparsification in distributed deep learning, providing a tighter theoretical analysis, empirical insights into gradient distributions, and an efficient approximate top-k selection algorithm to enhance scalability.

Contribution

It offers a detailed analysis of Top-k sparsification, derives a tighter convergence bound, and proposes a GPU-efficient approximate top-k algorithm for improved scalability.

Findings

01

Gradient distributions during training are characterized.

02

A tighter convergence bound for TopK-SGD is derived.

03

An efficient approximate top-k algorithm reduces computational overhead.

Abstract

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top- $k$ sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top- $k$ operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random- $k$ ) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hclhkbu/GaussianK-SGD
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and ELM

MethodsGradient Sparsification