FlashMoE: Fast Distributed MoE in a Single Kernel
Osayamen Jonathan Aimuyo, Byungsoo Oh, Rachee Singh

TL;DR
FlashMoE introduces a GPU-resident, single-kernel MoE operator that significantly improves GPU utilization, reduces latency, and enhances throughput for large-scale distributed neural network training by optimizing expert computation and communication.
Contribution
It presents FlashMoE, a novel GPU kernel design that fuses computation and communication, overcoming limitations of CPU-managed scheduling and kernel launch overheads in existing MoE implementations.
Findings
Up to 9x higher GPU utilization
6x lower latency
5.7x higher throughput
Abstract
The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE eliminates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · IoT and Edge/Fog Computing
MethodsMixture of Experts
