MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing
Seokjin Go, Divya Mahajan

TL;DR
MoETuner is an optimization framework that improves the efficiency of large-scale Mixture-of-Experts models by balancing expert placement and token routing, reducing latency and increasing throughput.
Contribution
It introduces an ILP-based method for optimal expert-to-GPU assignment that jointly considers load balancing and communication costs in MoE models.
Findings
Achieves up to 17.5% speedup in multi-node inference.
Effectively balances token load and reduces communication skew.
Demonstrates significant performance improvements over existing methods.
Abstract
Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently, offering sparse activation that reduces computational costs while increasing model capacity. However, as MoE models scale, they need to be distributed across GPU devices, thus face critical performance bottlenecks due to their large memory footprint. Expert parallelism distributes experts across GPUs, however, faces key challenges including an unbalanced token routing and expert activation, resulting in communication tail latency and processing inefficiencies. While existing solutions address some of these issues, they fail to resolve the dual challenges of load imbalance and communication skew. The imbalance in token processing load across experts causes uneven processing times on different GPUs, while communication skew between GPUs leads to unbalanced inter-GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Speech and dialogue systems · Context-Aware Activity Recognition Systems
MethodsMixture of Experts · Sparse Evolutionary Training
