Downlink Compression Improves TopK Sparsification
William Zou, Hans De Sterck, Jun Liu

TL;DR
This paper demonstrates that applying gradient compression in both directions during distributed training can reduce communication costs and potentially improve convergence, challenging previous beliefs that extra compression harms model performance.
Contribution
The authors extend the convergence analysis of topK SGD to bidirectional compression and empirically show its benefits over unidirectional compression in distributed neural network training.
Findings
Bidirectional topK SGD reduces communication overhead.
Bidirectional compression can improve convergence bounds.
Models trained with bidirectional topK SGD perform as well as unidirectional methods.
Abstract
Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the largest K components before sending it to other nodes. While some authors have investigated topK gradient sparsification in the parameter-server framework by applying topK compression in both the worker-to-server (uplink) and server-to-worker (downlink) direction, the currently accepted belief says that adding extra compression degrades the convergence of the model. We demonstrate, on the contrary, that adding downlink compression can potentially improve the performance of topK sparsification: not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Privacy-Preserving Technologies in Data
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Gradient Sparsification · Stochastic Gradient Descent
