HarMoEny: Efficient Multi-GPU Inference of MoE Models

Zachary Doucet; Rishi Sharma; Martijn de Vos; Rafael Pires; Anne-Marie Kermarrec; Oana Balmau

arXiv:2506.12417·cs.DC·June 18, 2025

HarMoEny: Efficient Multi-GPU Inference of MoE Models

Zachary Doucet, Rishi Sharma, Martijn de Vos, Rafael Pires, Anne-Marie Kermarrec, Oana Balmau

PDF

Open Access

TL;DR

HarMoEny is a novel method that improves multi-GPU inference of MoE models by balancing load among experts and GPUs, significantly reducing latency and increasing throughput.

Contribution

It introduces two simple techniques—dynamic token redistribution and asynchronous expert prefetching—to effectively address load imbalance in multi-GPU MoE inference.

Findings

01

Increases throughput by 37%-70% under load imbalance.

02

Reduces time-to-first-token by 34%-41%.

03

Decreases GPU idling time by up to 84%.

Abstract

Mixture-of-Experts (MoE) models offer computational efficiency during inference by activating only a subset of specialized experts for a given input. This enables efficient model scaling on multi-GPU systems that use expert parallelism without compromising performance. However, load imbalance among experts and GPUs introduces waiting times, which can significantly increase inference latency. To address this challenge, we propose HarMoEny, a novel solution to address MoE load imbalance through two simple techniques: (i) dynamic token redistribution to underutilized GPUs and (ii) asynchronous prefetching of experts from the system to GPU memory. These techniques achieve a near-perfect load balance among experts and GPUs and mitigate delays caused by overloaded GPUs. We implement HarMoEny and compare its latency and throughput with four MoE baselines using real-world and synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Machine Learning in Healthcare