Deep Gradient Compression: Reducing the Communication Bandwidth for   Distributed Training

Yujun Lin; Song Han; Huizi Mao; Yu Wang; William J. Dally

arXiv:1712.01887·cs.CV·June 24, 2020·645 cites

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally

PDF

Open Access 5 Repos

TL;DR

Deep Gradient Compression significantly reduces communication bandwidth in distributed training by eliminating redundant gradients, enabling scalable training on commodity networks and mobile devices without sacrificing accuracy.

Contribution

The paper introduces Deep Gradient Compression, a novel method that reduces gradient communication by 270x to 600x while maintaining model accuracy in distributed training.

Findings

01

Achieves up to 600x gradient compression ratio

02

Enables training on commodity Ethernet and mobile devices

03

Maintains accuracy across multiple datasets and models

Abstract

Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Speech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning

MethodsStochastic Gradient Descent