Eliminating Hidden Serialization in Multi-Node Megakernel Communication

Byungsoo Oh; Rachee Singh

arXiv:2605.00686·cs.DC·May 4, 2026

Eliminating Hidden Serialization in Multi-Node Megakernel Communication

Byungsoo Oh, Rachee Singh

PDF

TL;DR

Perseus introduces techniques to eliminate serialization bottlenecks in multi-node Megakernel communication for Mixture-of-Experts inference, significantly improving performance by reducing fence overhead and NIC stalls.

Contribution

It presents Perseus, a novel approach that removes hidden serialization in RDMA transports, enabling faster multi-node MoE inference performance.

Findings

01

Perseus achieves up to 10.3× end-to-end speedup on proxy-based transports.

02

Perseus matches or exceeds GPU-direct performance by up to 1.2×.

03

Serialization, not transport choice, limits multi-node Megakernel performance.

Abstract

Recent megakernel designs for Mixture-of-Experts (MoE) inference fuse expert computation with fine-grained, GPU-initiated communication into a single persistent GPU kernel, and outperform collective-based MoE on a single node by overlapping data transfer with compute at tile granularity. This benefit does not carry over cleanly to multi-node inference, where experts span many nodes connected by an RDMA fabric. Communication-bound MoE models regress by up to $10 \times$ on 8 nodes, and the regression worsens with node count. We trace this regression to hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline, and its cost grows with the number of concurrent transfers. As a result, models whose per-expert compute is too small to absorb this inflated network latency expose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.