Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally

TL;DR
Deep Gradient Compression significantly reduces communication bandwidth in distributed training by eliminating redundant gradients, enabling scalable training on commodity networks and mobile devices without sacrificing accuracy.
Contribution
The paper introduces Deep Gradient Compression, a novel method that reduces gradient communication by 270x to 600x while maintaining model accuracy in distributed training.
Findings
Achieves up to 600x gradient compression ratio
Enables training on commodity Ethernet and mobile devices
Maintains accuracy across multiple datasets and models
Abstract
Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- synxlin/deep-gradient-compressionpytorchOfficial
- PaddlePaddle/FleetX/blob/develop/examples/resnet/train_fleet_static_dgc.pypaddle
- MindSpore-scientific-2/code-5/tree/main/Deep-Gradient-Compressionmindspore
- limberc/deep-gradient-compressionpytorch
- MindSpore-scientific/code-12/tree/main/Deep-Gradient-Compressionmindspore
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Speech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning
MethodsStochastic Gradient Descent
