GPU-Initiated Networking for NCCL

Khaled Hamidouche (1); John Bachan (1); Pak Markthub (1); Peter-Jan Gootzen (1); Elena Agostini (1); Sylvain Jeaugey (1); Aamir Shafi (1); Georgios Theodorakis (1); Manjunath Gorentla Venkata (1) ((1) NVIDIA Corporation)

arXiv:2511.15076·cs.DC·November 26, 2025

GPU-Initiated Networking for NCCL

Khaled Hamidouche (1), John Bachan (1), Pak Markthub (1), Peter-Jan Gootzen (1), Elena Agostini (1), Sylvain Jeaugey (1), Aamir Shafi (1), Georgios Theodorakis (1), Manjunath Gorentla Venkata (1) ((1) NVIDIA Corporation)

PDF

Open Access

TL;DR

This paper introduces GPU-Initiated Networking (GIN) in NCCL 2.28, enabling direct GPU-to-network communication to reduce latency and improve efficiency for AI workloads like Mixture-of-Experts.

Contribution

It presents the GIN architecture and APIs, allowing device-initiated communication that integrates seamlessly with NCCL and supports various hardware backends.

Findings

01

GIN reduces communication latency in MoE workloads.

02

Integration with DeepEP demonstrates practical benefits.

03

Benchmarking confirms low-latency performance with GIN.

Abstract

Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations - a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead. NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Cloud Computing and Resource Management