Compressed Communication for Distributed Training: Adaptive Methods and System
Yuchen Zhong, Cong Xie, Shuai Zheng, Haibin Lin

TL;DR
This paper introduces a new adaptive gradient method with gradient compression and a scalable system, BytePS-Compress, significantly reducing communication overhead and training time in distributed machine learning without accuracy loss.
Contribution
The paper proposes a novel adaptive gradient method with gradient compression and develops BytePS-Compress, a system enabling efficient two-way gradient compression in distributed training.
Findings
Improved training times for ResNet50, VGG16, and BERT-base by up to 58.1%.
Achieved a 333x compression rate for BERT training.
Maintained model accuracy despite high compression rates.
Abstract
Communication overhead severely hinders the scalability of distributed machine learning systems. Recently, there has been a growing interest in using gradient compression to reduce the communication overhead of the distributed training. However, there is little understanding of applying gradient compression to adaptive gradient methods. Moreover, its performance benefits are often limited by the non-negligible compression overhead. In this paper, we first introduce a novel adaptive gradient method with gradient compression. We show that the proposed method has a convergence rate of for non-convex problems. In addition, we develop a scalable system called BytePS-Compress for two-way compression, where the gradients are compressed in both directions between workers and parameter servers. BytePS-Compress pipelines the compression and decompression on CPUs and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Linear Warmup With Linear Decay · Multi-Head Attention · Weight Decay · Attention Dropout · WordPiece · Dropout
