Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony

Shaoyu Wang; Guangrong He; Geon-Woo Kim; Yanqi Zhou; Seo Jin Park

arXiv:2505.08944·cs.DC·May 30, 2025

Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony

Shaoyu Wang, Guangrong He, Geon-Woo Kim, Yanqi Zhou, Seo Jin Park

PDF

Open Access

TL;DR

This paper presents Asynchronous Expert Parallelism (AEP) and the AMoE system, enabling more efficient, scalable, and cost-effective serving of Mixture-of-Experts models by reducing synchronization and load imbalance issues.

Contribution

Introduction of AEP and AMoE, a novel asynchronous serving system that improves GPU utilization and scalability for Mixture-of-Experts models.

Findings

01

Up to 2.7x throughput improvement over baselines

02

Nearly linear scalability across multiple nodes

03

Manageable latency increase with higher throughput

Abstract

Mixture-of-Experts (MoE) architectures offer the promise of larger model capacity without the prohibitive costs of fully dense designs. However, in real-world inference serving, load skew across experts often leads to suboptimal device utilization and excessive synchronization overheads. This paper introduces Asynchronous Expert Parallelism (AEP), a new paradigm that decouples layer execution from barrier-style synchronization. By dynamically queuing tokens at each layer (referred to as $μ$ -queuing) and adaptively re-batching them on demand, GPUs avoid waiting for straggling experts and instead continuously process whichever layer is ready. This asynchronous approach mitigates two major inefficiencies in traditional expert-parallel systems: (1) idle GPU time while waiting for the hottest expert, and (2) small-batch executions on colder experts that waste memory bandwidth. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · IoT and Edge/Fog Computing · Advanced Neural Network Applications