NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

Amos Goldman; Nimrod Boker; Maayan Sheraizin; Nimrod Admoni; Artem Polyakov; Subhadeep Bhattacharya; Fan Yu; Kai Sun; Georgios Theodorakis; Hsin-Chun Yin; Peter-Jan Gootzen; Aamir Shafi; Assaf Ravid; Salvatore Di Girolamo; James Dinan; Xiaofan Li; Manjunath Gorentla Venkata; Gil Bloch (NVIDIA Corporation)

arXiv:2603.13606·cs.DC·April 3, 2026

NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL

Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Admoni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, Peter-Jan Gootzen, Aamir Shafi, Assaf Ravid, Salvatore Di Girolamo, James Dinan, Xiaofan Li, Manjunath Gorentla Venkata

PDF

TL;DR

NCCL EP introduces a unified, GPU-initiated communication library for Mixture-of-Experts architectures, supporting low-latency and high-throughput modes with optimized intra- and inter-node communication on NVIDIA hardware.

Contribution

It presents NCCL EP, a novel MoE communication library built on NCCL's Device API, enabling efficient expert parallelism for large language models.

Findings

01

Demonstrates competitive low-latency kernel performance on H100 clusters.

02

Provides end-to-end evaluation with vLLM integration.

03

Supports both inference and training modes with topology-aware communication.

Abstract

Mixture-of-Experts (MoE) architectures have become essential for scaling large language models, driving the development of specialized device-initiated communication libraries such as DeepEP, Hybrid-EP, and others. These libraries demonstrate the performance benefits of GPU-initiated RDMA for MoE dispatch and combine operations. This paper presents NCCL EP (Expert Parallelism), a ground-up MoE communication library built entirely on NCCL's Device API. NCCL EP provides unified ncclEpDispatch and ncclEpCombine primitives with both C and Python interfaces, supporting Low-Latency (LL) mode for inference decoding and High-Throughput (HT) mode for training and inference prefill. LL targets small batch sizes (1-128 tokens) using direct all-to-all RDMA+NVLink mesh connectivity with double-buffered communication for overlapping dispatch and combine phases. HT targets large batches (4096+…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.