NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
Jiamin Wang, Zhijing Ye, Xiaodong Yu

TL;DR
NCCLZ enhances GPU collective communication by decoupling quantization and entropy coding, enabling flexible, efficient compression that significantly accelerates multi-node GPU workloads.
Contribution
It introduces a novel approach that separates quantization and entropy coding layers, improving compression ratio, flexibility, and overlap with communication in GPU collectives.
Findings
Achieves up to 9.65x speedup over NCCL.
Provides up to 3.34x improvement over prior compression libraries.
Effectively overlaps compression with communication to reduce overhead.
Abstract
Collective communication is a major bottleneck for multi-node GPU workloads in scientific computing and distributed deep learning, especially when inter-node bandwidth is limited. Although NCCL provides optimized GPU-centric collectives, large messages can still dominate end-to-end performance. Existing compression-enabled collective libraries either rely on MPI-based stacks that cannot fully exploit NCCL, omit entropy coding, or tightly couple full compressors with communication primitives, limiting compression ratio, flexibility, and communication-computation overlap. This paper presents NCCLZ, a compression-enabled GPU collectives that decouples quantization and entropy coding and integrates them at different layers of the stack. NCCLZ places quantization at the interface, embeds entropy coding into NCCL primitives, uses a lightweight device-side selector to choose coding strategies,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
