GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

Sourish Wawdhane; Avinash Kumar; Poulami Das

arXiv:2605.19945·cs.DC·May 20, 2026

GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

Sourish Wawdhane, Avinash Kumar, Poulami Das

PDF

TL;DR

GEM is a GPU-aware expert mapping framework for MoE models that reduces synchronization bottlenecks by considering GPU variability and expert usage patterns, improving latency.

Contribution

GEM introduces a novel GPU-variability-aware expert placement strategy that accounts for GPU performance variability and expert usage to optimize MoE inference.

Findings

01

GEM reduces end-to-end latency by up to 16.5%.

02

GEM achieves an average latency improvement of 7.9%.

03

GEM effectively balances token loads considering GPU variability.

Abstract

Mixture-of-Expert (MoE) models enable efficient inference by employing smaller experts and activating only a subset of them per token. MoE serving engines distribute experts across multiple GPUs and route tokens to appropriate GPUs at inference time based on experts activated. They process tokens in lock-step fashion, where tokens within a batch must finish processing before proceeding to the next layer. This synchronization barrier acts as a critical bottleneck because the performance of MoE models is limited by the straggler GPU that finishes last. Stragglers emerge when too many heavily used experts are placed on the same GPU or the slowest GPU. While prior works place experts that balance token loads across GPUs, they all overlook GPU variability and often place highly used experts on the slowest GPUs. We propose GEM, GPU-variability-aware Expert Mapping, a framework for GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.