MergeComp: A Compression Scheduler for Scalable Communication-Efficient   Distributed Training

Zhuang Wang; Xinyu Wu; T.S. Eugene Ng

arXiv:2103.15195·cs.DC·March 30, 2021

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Zhuang Wang, Xinyu Wu, T.S. Eugene Ng

PDF

Open Access 1 Repo

TL;DR

MergeComp is a novel compression scheduler that dynamically optimizes gradient compression in distributed training, significantly enhancing scalability and performance without sacrificing accuracy.

Contribution

It introduces MergeComp, an automatic scheduler for gradient compression that improves scalability in distributed training across multiple algorithms.

Findings

01

Up to 3.83x performance improvement with MergeComp.

02

Achieves 99% scaling efficiency on high-speed networks.

03

Effective across nine popular compression algorithms.

Abstract

Large-scale distributed training is increasingly becoming communication bound. Many gradient compression algorithms have been proposed to reduce the communication overhead and improve scalability. However, it has been observed that in some cases gradient compression may even harm the performance of distributed training. In this paper, we propose MergeComp, a compression scheduler to optimize the scalability of communication-efficient distributed training. It automatically schedules the compression operations to optimize the performance of compression algorithms without the knowledge of model architectures or system parameters. We have applied MergeComp to nine popular compression algorithms. Our evaluations show that MergeComp can improve the performance of compression algorithms by up to 3.83x without losing accuracy. It can even achieve a scaling factor of distributed training up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Crystal-wxy/mergeComp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Real-Time Systems Scheduling