S2 Reducer: High-Performance Sparse Communication to Accelerate   Distributed Deep Learning

Keshi Ge; Yongquan Fu; Zhiquan Lai; Xiaoge Deng; Dongsheng Li

arXiv:2110.02140·cs.DC·November 29, 2021·1 cites

S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

Keshi Ge, Yongquan Fu, Zhiquan Lai, Xiaoge Deng, Dongsheng Li

PDF

Open Access

TL;DR

The paper introduces S2 Reducer, a sketch-based method that significantly reduces sparse gradient communication in distributed deep learning, maintaining accuracy while accelerating training.

Contribution

S2 Reducer is a novel sparse gradient aggregation technique that efficiently compresses non-zero gradients with convergence guarantees, improving communication efficiency in distributed SGD.

Findings

01

Reduces 81% of sparse communication overhead

02

Achieves 1.8× speedup in training

03

Maintains the same accuracy as existing methods

Abstract

Distributed stochastic gradient descent (SGD) approach has been widely used in large-scale deep learning, and the gradient collective method is vital to ensure the training scalability of the distributed deep learning system. Collective communication such as AllReduce has been widely adopted for the distributed SGD process to reduce the communication time. However, AllReduce incurs large bandwidth resources while most gradients are sparse in many cases since many gradient values are zeros and should be efficiently compressed for bandwidth saving. To reduce the sparse gradient communication overhead, we propose Sparse-Sketch Reducer (S2 Reducer), a novel sketch-based sparse gradient aggregation method with convergence guarantees. S2 Reducer reduces the communication cost by only compressing the non-zero gradients with count-sketch and bitmap, and enables the efficient AllReduce operators…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Advanced Neural Network Applications

MethodsStochastic Gradient Descent