gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan, Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck, Cappello, Yanfei Guo, Rajeev Thakur

TL;DR
gZCCL is a novel framework that enhances GPU cluster communication efficiency by integrating accuracy-aware compression, significantly outperforming existing solutions while maintaining high data quality.
Contribution
This paper introduces gZCCL, the first general framework for GPU-aware, compression-enabled collectives with error control, improving performance and data accuracy.
Findings
Up to 4.5X faster collective computation
Up to 28.7X faster data movement
High reconstructed data quality in applications
Abstract
GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Caching and Content Delivery · Peer-to-Peer Network Technologies
