MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training
Zhuang Wang, Xinyu Wu, T.S. Eugene Ng

TL;DR
MergeComp is a novel compression scheduler that dynamically optimizes gradient compression in distributed training, significantly enhancing scalability and performance without sacrificing accuracy.
Contribution
It introduces MergeComp, an automatic scheduler for gradient compression that improves scalability in distributed training across multiple algorithms.
Findings
Up to 3.83x performance improvement with MergeComp.
Achieves 99% scaling efficiency on high-speed networks.
Effective across nine popular compression algorithms.
Abstract
Large-scale distributed training is increasingly becoming communication bound. Many gradient compression algorithms have been proposed to reduce the communication overhead and improve scalability. However, it has been observed that in some cases gradient compression may even harm the performance of distributed training. In this paper, we propose MergeComp, a compression scheduler to optimize the scalability of communication-efficient distributed training. It automatically schedules the compression operations to optimize the performance of compression algorithms without the knowledge of model architectures or system parameters. We have applied MergeComp to nine popular compression algorithms. Our evaluations show that MergeComp can improve the performance of compression algorithms by up to 3.83x without losing accuracy. It can even achieve a scaling factor of distributed training up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Real-Time Systems Scheduling
