UCCL-EP: Portable Expert-Parallel Communication
Ziming Mao, Yihan Zhang, Chihan Cui, Zhen Huang, Kaichao You, Zhongjie Chen, Zhiying Xu, Zhenyu Gu, Scott Shenker, Costin Raiciu, Yang Zhou, Ion Stoica

TL;DR
UCCL-EP is a portable, high-performance expert-parallel communication system that maintains DeepEP-level efficiency across diverse GPU and NIC hardware by replacing GPU-initiated RDMA with a GPU-CPU control channel and emulating ordering semantics.
Contribution
The paper introduces UCCL-EP, a novel portable EP communication system that achieves high performance across heterogeneous hardware platforms by leveraging a GPU-CPU control channel and RDMA emulation techniques.
Findings
Outperforms existing EP solutions by up to 2.1x on EFA platforms.
Achieves comparable performance to DeepEP on NVIDIA-only platforms.
Improves token throughput by up to 40-45% in various benchmarks.
Abstract
Mixture-of-Experts (MoE) workloads rely on expert parallelism (EP) to achieve high GPU efficiency. State-of-the-art EP communication systems such as DeepEP demonstrate strong performance but exhibit poor portability across heterogeneous GPU and NIC platforms. The poor portability is rooted in architecture: GPU-initiated token-level RDMA communication requires tight vertical integration between GPUs and NICs, e.g., GPU writes to NIC driver/MMIO interfaces. We present UCCL-EP, a portable EP communication system that delivers DeepEP-level performance across heterogeneous GPU and NIC hardware. UCCL-EP replaces GPU-initiated RDMA with a high-throughput GPU-CPU control channel: compact token-routing commands are transferred to multithreaded CPU proxies, which then issue GPUDirect RDMA operations on behalf of GPUs. UCCL-EP further emulates various ordering semantics required by specialized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Graph Theory and Algorithms · Network Packet Processing and Optimization
