Variance-based Gradient Compression for Efficient Distributed Deep Learning
Yusuke Tsuzuku, Hiroto Imachi, Takuya Akiba

TL;DR
This paper introduces a variance-based gradient compression method that significantly reduces communication overhead in distributed deep learning, maintaining accuracy and enabling efficient training in low-bandwidth environments.
Contribution
The paper proposes a novel gradient compression technique based on gradient variance, achieving high compression ratios without sacrificing model accuracy.
Findings
High compression ratios achieved
Maintains model accuracy
Enables efficient distributed training in low-bandwidth settings
Abstract
Due to the substantial computational cost, training state-of-the-art deep neural networks for large-scale datasets often requires distributed training using multiple computation workers. However, by nature, workers need to frequently communicate gradients, causing severe bottlenecks, especially on lower bandwidth connections. A few methods have been proposed to compress gradient for efficient communication, but they either suffer a low compression ratio or significantly harm the resulting model accuracy, particularly when applied to convolutional neural networks. To address these issues, we propose a method to reduce the communication overhead of distributed deep learning. Our key observation is that gradient updates can be delayed until an unambiguous (high amplitude, low variance) gradient has been calculated. We also present an efficient algorithm to compute the variance with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
