ZCCL: Significantly Improving Collective Communication With Error-Bounded Lossy Compression
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang,, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Khalid Alharthi,, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

TL;DR
ZCCL introduces error-bounded lossy compression to MPI collective communication, significantly reducing message sizes and communication costs while maintaining data accuracy, thus improving large-scale scientific application performance.
Contribution
The paper presents a novel framework and compressor for error-bounded lossy compression in MPI collectives, enhancing efficiency and generalizability over prior fixed-rate methods.
Findings
Achieves 1.9--8.9X performance improvement over baseline MPI collectives.
Reduces communication costs significantly while preserving data accuracy.
Effectively integrates into multiple MPI collective operations.
Abstract
With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communication turns out to be a critical bottleneck in large-scale distributed and parallel processing. The large message size in MPI collectives is particularly concerning because it can significantly degrade overall parallel performance. To address this issue, prior research simply applies off-the-shelf fixed-rate lossy compressors in the MPI collectives, leading to suboptimal performance, limited generalizability, and unbounded errors. In this paper, we propose a novel solution, called ZCCL, which leverages error-bounded lossy compression to significantly reduce the message size, resulting in a substantial reduction in communication costs. The key contributions are three-fold. (1) We develop two general, optimized lossy-compression-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
