CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling

Dong Xu (1); Han Meng (1); Xinyu Chen (2); Dengcheng Zhu (3); Wei Tang (3); Fei Liu (3); Liguang Xie (3); Wu Xiang (3); Rui Shi (3); Yue Li (3); Henry Hu (3); Hui Zhang (3); Jianping Jiang (4); Dong Li (1) ((1) UC Merced; (2) Zhejinag University; (3) Bytedance; (4) Xconn-tech)

arXiv:2602.22457·cs.DC·May 8, 2026

CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling

Dong Xu (1), Han Meng (1), Xinyu Chen (2), Dengcheng Zhu (3), Wei Tang (3), Fei Liu (3), Liguang Xie (3), Wu Xiang (3), Rui Shi (3), Yue Li (3), Henry Hu (3), Hui Zhang (3), Jianping Jiang (4), Dong Li (1) ((1) UC Merced, (2) Zhejinag University, (3) Bytedance, (4) Xconn-tech)

PDF

TL;DR

This paper introduces CCCL, a GPU collective communication library leveraging CXL shared memory pools for scalable, high-performance cross-node GPU operations without traditional networking, improving efficiency and reducing costs.

Contribution

The paper presents CCCL, a novel collective communication library that uses CXL shared memory pools to enable scalable, efficient cross-node GPU communication without RDMA-based networking.

Findings

01

CCCL achieves 1.34× to 1.94× performance improvements over RDMA-based implementations.

02

In LLM training, CCCL provides 1.11× speedup and reduces hardware costs by 2.75×.

03

Evaluation on multiple nodes demonstrates CCCL's high efficiency and scalability.

Abstract

Large language models (LLMs) training or inference across multiple nodes introduces significant pressure on GPU memory and interconnect bandwidth. The Compute Express Link (CXL) shared memory pool offers a scalable solution by enabling memory sharing across nodes, reducing over-provisioning and improving resource utilization. We propose \name, a collective communication library, leveraging the CXL shared memory pool to support cross-node GPU operations without relying on traditional RDMA-based networking. Our design addresses the challenges on synchronization, data interleaving, and communication parallelization faced by using the CXL shared memory pool for collective communications. Evaluating on multiple nodes with a TITAN-II CXL switch and six Micron CZ120 memory cards, we show that \name achieves highly efficient collective operations across hosts, demonstrating CXL's potential for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.