FlashMoE: Fast Distributed MoE in a Single Kernel

Osayamen Jonathan Aimuyo; Byungsoo Oh; Rachee Singh

arXiv:2506.04667·cs.DC·November 11, 2025

FlashMoE: Fast Distributed MoE in a Single Kernel

Osayamen Jonathan Aimuyo, Byungsoo Oh, Rachee Singh

PDF

Open Access 2 Repos 1 Video

TL;DR

FlashMoE introduces a GPU-resident, single-kernel MoE operator that significantly improves GPU utilization, reduces latency, and enhances throughput for large-scale distributed neural network training by optimizing expert computation and communication.

Contribution

It presents FlashMoE, a novel GPU kernel design that fuses computation and communication, overcoming limitations of CPU-managed scheduling and kernel launch overheads in existing MoE implementations.

Findings

01

Up to 9x higher GPU utilization

02

6x lower latency

03

5.7x higher throughput

Abstract

The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE eliminates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

FlashMoE: Fast Distributed MoE in a Single Kernel· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · IoT and Edge/Fog Computing

MethodsMixture of Experts