Compressed Communication for Distributed Training: Adaptive Methods and   System

Yuchen Zhong; Cong Xie; Shuai Zheng; Haibin Lin

arXiv:2105.07829·cs.DC·May 19, 2021·1 cites

Compressed Communication for Distributed Training: Adaptive Methods and System

Yuchen Zhong, Cong Xie, Shuai Zheng, Haibin Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new adaptive gradient method with gradient compression and a scalable system, BytePS-Compress, significantly reducing communication overhead and training time in distributed machine learning without accuracy loss.

Contribution

The paper proposes a novel adaptive gradient method with gradient compression and develops BytePS-Compress, a system enabling efficient two-way gradient compression in distributed training.

Findings

01

Improved training times for ResNet50, VGG16, and BERT-base by up to 58.1%.

02

Achieved a 333x compression rate for BERT training.

03

Maintained model accuracy despite high compression rates.

Abstract

Communication overhead severely hinders the scalability of distributed machine learning systems. Recently, there has been a growing interest in using gradient compression to reduce the communication overhead of the distributed training. However, there is little understanding of applying gradient compression to adaptive gradient methods. Moreover, its performance benefits are often limited by the non-negligible compression overhead. In this paper, we first introduce a novel adaptive gradient method with gradient compression. We show that the proposed method has a convergence rate of $O (1/ T)$ for non-convex problems. In addition, we develop a scalable system called BytePS-Compress for two-way compression, where the gradients are compressed in both directions between workers and parameter servers. BytePS-Compress pipelines the compression and decompression on CPUs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vycezhong/byteps-compress
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Softmax · Linear Warmup With Linear Decay · Multi-Head Attention · Weight Decay · Attention Dropout · WordPiece · Dropout