gZCCL: Compression-Accelerated Collective Communication Framework for   GPU Clusters

Jiajun Huang; Sheng Di; Xiaodong Yu; Yujia Zhai; Jinyang Liu; Yafan; Huang; Ken Raffenetti; Hui Zhou; Kai Zhao; Xiaoyi Lu; Zizhong Chen; Franck; Cappello; Yanfei Guo; Rajeev Thakur

arXiv:2308.05199·cs.DC·May 8, 2024

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan, Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck, Cappello, Yanfei Guo, Rajeev Thakur

PDF

Open Access

TL;DR

gZCCL is a novel framework that enhances GPU cluster communication efficiency by integrating accuracy-aware compression, significantly outperforming existing solutions while maintaining high data quality.

Contribution

This paper introduces gZCCL, the first general framework for GPU-aware, compression-enabled collectives with error control, improving performance and data accuracy.

Findings

01

Up to 4.5X faster collective computation

02

Up to 28.7X faster data movement

03

High reconstructed data quality in applications

Abstract

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Caching and Content Delivery · Peer-to-Peer Network Technologies