GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems
Sourish Wawdhane, Avinash Kumar, Poulami Das

TL;DR
GEM is a GPU-aware expert mapping framework for MoE models that reduces synchronization bottlenecks by considering GPU variability and expert usage patterns, improving latency.
Contribution
GEM introduces a novel GPU-variability-aware expert placement strategy that accounts for GPU performance variability and expert usage to optimize MoE inference.
Findings
GEM reduces end-to-end latency by up to 16.5%.
GEM achieves an average latency improvement of 7.9%.
GEM effectively balances token loads considering GPU variability.
Abstract
Mixture-of-Expert (MoE) models enable efficient inference by employing smaller experts and activating only a subset of them per token. MoE serving engines distribute experts across multiple GPUs and route tokens to appropriate GPUs at inference time based on experts activated. They process tokens in lock-step fashion, where tokens within a batch must finish processing before proceeding to the next layer. This synchronization barrier acts as a critical bottleneck because the performance of MoE models is limited by the straggler GPU that finishes last. Stragglers emerge when too many heavily used experts are placed on the same GPU or the slowest GPU. While prior works place experts that balance token loads across GPUs, they all overlook GPU variability and often place highly used experts on the slowest GPUs. We propose GEM, GPU-variability-aware Expert Mapping, a framework for GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
