Downlink Compression Improves TopK Sparsification

William Zou; Hans De Sterck; Jun Liu

arXiv:2209.15203·cs.LG·October 3, 2022

Downlink Compression Improves TopK Sparsification

William Zou, Hans De Sterck, Jun Liu

PDF

Open Access

TL;DR

This paper demonstrates that applying gradient compression in both directions during distributed training can reduce communication costs and potentially improve convergence, challenging previous beliefs that extra compression harms model performance.

Contribution

The authors extend the convergence analysis of topK SGD to bidirectional compression and empirically show its benefits over unidirectional compression in distributed neural network training.

Findings

01

Bidirectional topK SGD reduces communication overhead.

02

Bidirectional compression can improve convergence bounds.

03

Models trained with bidirectional topK SGD perform as well as unidirectional methods.

Abstract

Training large neural networks is time consuming. To speed up the process, distributed training is often used. One of the largest bottlenecks in distributed training is communicating gradients across different nodes. Different gradient compression techniques have been proposed to alleviate the communication bottleneck, including topK gradient sparsification, which truncates the gradient to the largest K components before sending it to other nodes. While some authors have investigated topK gradient sparsification in the parameter-server framework by applying topK compression in both the worker-to-server (uplink) and server-to-worker (downlink) direction, the currently accepted belief says that adding extra compression degrades the convergence of the model. We demonstrate, on the contrary, that adding downlink compression can potentially improve the performance of topK sparsification: not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Privacy-Preserving Technologies in Data

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Gradient Sparsification · Stochastic Gradient Descent