NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication

Yusheng Zheng

arXiv:2603.11438·cs.DC·May 5, 2026

NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication

Yusheng Zheng

PDF

TL;DR

NCCLbpf introduces a verified, composable, and high-performance eBPF-based extension framework for NCCL, enhancing GPU collective communication safety and efficiency with minimal overhead.

Contribution

It embeds a userspace eBPF runtime into NCCL, enabling load-time verification, policy composition, and hot-reloads without modifying NCCL itself.

Findings

01

NCCLbpf adds only 80-130 ns overhead per decision, less than 0.03% of latency.

02

It prevents unsafe plugin behaviors at load-time.

03

It improves AllReduce throughput by up to 27% with message-size-aware policies.

Abstract

NCCL is the de facto standard for collective GPU communication in large-scale distributed training, relying heavily on plugins to customize runtime behavior. However, these plugins execute as unverified native code within NCCL's address space, risking job crashes, silent state corruption, and downtime from restarts during policy updates. Inspired by kernel extensibility models, we introduce NCCLbpf, a verified, high-performance extension framework embedding a userspace eBPF runtime directly into NCCL's existing plugin interfaces, without modifying NCCL itself. NCCLbpf offers load-time static verification to prevent unsafe plugin execution, structured cross-plugin maps enabling composable policies and closed-loop adaptation, and atomic policy hot-reloads eliminating downtime previously required for policy updates. Evaluations on 8x NVIDIA B300 GPUs connected via NVLink demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.