The Landscape of GPU-Centric Communication
Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Do\u{g}an Sa\u{g}bili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov

TL;DR
This paper surveys GPU-centric communication methods, vendor tools, and libraries, highlighting their roles in improving multi-GPU scalability and performance in HPC and ML applications.
Contribution
It provides a comprehensive landscape of GPU-centric communication, clarifying terminology, categorizing approaches, and discussing future research directions.
Findings
Vendor mechanisms reduce CPU involvement in multi-GPU communication.
Major libraries offer diverse benefits and face specific challenges.
Performance insights guide optimal exploitation of multi-GPU systems.
Abstract
In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
