HarMoEny: Efficient Multi-GPU Inference of MoE Models
Zachary Doucet, Rishi Sharma, Martijn de Vos, Rafael Pires, Anne-Marie Kermarrec, Oana Balmau

TL;DR
HarMoEny is a novel method that improves multi-GPU inference of MoE models by balancing load among experts and GPUs, significantly reducing latency and increasing throughput.
Contribution
It introduces two simple techniques—dynamic token redistribution and asynchronous expert prefetching—to effectively address load imbalance in multi-GPU MoE inference.
Findings
Increases throughput by 37%-70% under load imbalance.
Reduces time-to-first-token by 34%-41%.
Decreases GPU idling time by up to 84%.
Abstract
Mixture-of-Experts (MoE) models offer computational efficiency during inference by activating only a subset of specialized experts for a given input. This enables efficient model scaling on multi-GPU systems that use expert parallelism without compromising performance. However, load imbalance among experts and GPUs introduces waiting times, which can significantly increase inference latency. To address this challenge, we propose HarMoEny, a novel solution to address MoE load imbalance through two simple techniques: (i) dynamic token redistribution to underutilized GPUs and (ii) asynchronous prefetching of experts from the system to GPU memory. These techniques achieve a near-perfect load balance among experts and GPUs and mitigate delays caused by overloaded GPUs. We implement HarMoEny and compare its latency and throughput with four MoE baselines using real-world and synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Machine Learning in Healthcare
