MoETuner: Optimized Mixture of Expert Serving with Balanced Expert   Placement and Token Routing

Seokjin Go; Divya Mahajan

arXiv:2502.06643·cs.LG·February 11, 2025

MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing

Seokjin Go, Divya Mahajan

PDF

Open Access

TL;DR

MoETuner is an optimization framework that improves the efficiency of large-scale Mixture-of-Experts models by balancing expert placement and token routing, reducing latency and increasing throughput.

Contribution

It introduces an ILP-based method for optimal expert-to-GPU assignment that jointly considers load balancing and communication costs in MoE models.

Findings

01

Achieves up to 17.5% speedup in multi-node inference.

02

Effectively balances token load and reduces communication skew.

03

Demonstrates significant performance improvements over existing methods.

Abstract

Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently, offering sparse activation that reduces computational costs while increasing model capacity. However, as MoE models scale, they need to be distributed across GPU devices, thus face critical performance bottlenecks due to their large memory footprint. Expert parallelism distributes experts across GPUs, however, faces key challenges including an unbalanced token routing and expert activation, resulting in communication tail latency and processing inefficiencies. While existing solutions address some of these issues, they fail to resolve the dual challenges of load imbalance and communication skew. The imbalance in token processing load across experts causes uneven processing times on different GPUs, while communication skew between GPUs leads to unbalanced inter-GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · Speech and dialogue systems · Context-Aware Activity Recognition Systems

MethodsMixture of Experts · Sparse Evolutionary Training