Accelerating Intra-Node GPU-to-GPU Communication Through Multi-Path Transfers with CUDA Graphs
Amirhossein Sojoodi, Yiltan Hassan Temucin, Amirreza Baratisedeh, Hamed Sharifian, and Ahmad Afsahi

TL;DR
This paper introduces a novel CUDA Graph-based multi-path GPU communication method within UCX, significantly enhancing intra-node GPU-to-GPU bandwidth by up to 2.95 times in HPC applications.
Contribution
It is the first to seamlessly integrate CUDA Graphs into UCX for multi-path intra-node GPU communication, optimizing performance across multiple communication paths.
Findings
Achieved up to 2.95x bandwidth improvement over single-path UCX.
Significant reduction in communication overhead with multi-path approach.
Effective utilization of NVLink, PCIe, and host paths for GPU communication.
Abstract
Effective intra-node GPU communication is essential for optimizing performance in MPI-based HPC applications, especially when leveraging multiple communication paths. In this study, we propose a novel approach that integrates CUDA Graphs into the UCX framework to enhance intra-node multi-path point-to-point GPU communication. By concurrently leveraging multiple paths, including NVLink and PCIe through the host, and optimizing communication workflows using CUDA Graph, we achieve significant reductions in communication overhead and improve execution efficiency. To the best of our knowledge, our proposed approach is the first to seamlessly integrate CUDA Graphs into UCX. Through extensive experiments on a four-GPU node, our proposed CUDA Graph-based multi-path communication approach achieves up to a 2.95x bandwidth improvement, compared to the single-path UCX (UCT::CUDA-IPC), in GPU-to-GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
