Understanding Top-k Sparsification in Distributed Deep Learning
Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, Simon See

TL;DR
This paper investigates the behavior of Top-k sparsification in distributed deep learning, providing a tighter theoretical analysis, empirical insights into gradient distributions, and an efficient approximate top-k selection algorithm to enhance scalability.
Contribution
It offers a detailed analysis of Top-k sparsification, derives a tighter convergence bound, and proposes a GPU-efficient approximate top-k algorithm for improved scalability.
Findings
Gradient distributions during training are characterized.
A tighter convergence bound for TopK-SGD is derived.
An efficient approximate top-k algorithm reduces computational overhead.
Abstract
Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top- sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top- operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and ELM
MethodsGradient Sparsification
