Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony
Shaoyu Wang, Guangrong He, Geon-Woo Kim, Yanqi Zhou, Seo Jin Park

TL;DR
This paper presents Asynchronous Expert Parallelism (AEP) and the AMoE system, enabling more efficient, scalable, and cost-effective serving of Mixture-of-Experts models by reducing synchronization and load imbalance issues.
Contribution
Introduction of AEP and AMoE, a novel asynchronous serving system that improves GPU utilization and scalability for Mixture-of-Experts models.
Findings
Up to 2.7x throughput improvement over baselines
Nearly linear scalability across multiple nodes
Manageable latency increase with higher throughput
Abstract
Mixture-of-Experts (MoE) architectures offer the promise of larger model capacity without the prohibitive costs of fully dense designs. However, in real-world inference serving, load skew across experts often leads to suboptimal device utilization and excessive synchronization overheads. This paper introduces Asynchronous Expert Parallelism (AEP), a new paradigm that decouples layer execution from barrier-style synchronization. By dynamically queuing tokens at each layer (referred to as -queuing) and adaptively re-batching them on demand, GPUs avoid waiting for straggling experts and instead continuously process whichever layer is ready. This asynchronous approach mitigates two major inefficiencies in traditional expert-parallel systems: (1) idle GPU time while waiting for the hottest expert, and (2) small-batch executions on colder experts that waste memory bandwidth. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · IoT and Edge/Fog Computing · Advanced Neural Network Applications
