S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning
Keshi Ge, Yongquan Fu, Zhiquan Lai, Xiaoge Deng, Dongsheng Li

TL;DR
The paper introduces S2 Reducer, a sketch-based method that significantly reduces sparse gradient communication in distributed deep learning, maintaining accuracy while accelerating training.
Contribution
S2 Reducer is a novel sparse gradient aggregation technique that efficiently compresses non-zero gradients with convergence guarantees, improving communication efficiency in distributed SGD.
Findings
Reduces 81% of sparse communication overhead
Achieves 1.8× speedup in training
Maintains the same accuracy as existing methods
Abstract
Distributed stochastic gradient descent (SGD) approach has been widely used in large-scale deep learning, and the gradient collective method is vital to ensure the training scalability of the distributed deep learning system. Collective communication such as AllReduce has been widely adopted for the distributed SGD process to reduce the communication time. However, AllReduce incurs large bandwidth resources while most gradients are sparse in many cases since many gradient values are zeros and should be efficiently compressed for bandwidth saving. To reduce the sparse gradient communication overhead, we propose Sparse-Sketch Reducer (S2 Reducer), a novel sketch-based sparse gradient aggregation method with convergence guarantees. S2 Reducer reduces the communication cost by only compressing the non-zero gradients with count-sketch and bitmap, and enables the efficient AllReduce operators…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Advanced Neural Network Applications
MethodsStochastic Gradient Descent
