Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling
Jialong Li, Shreyansh Tripathi, Lakshay Rastogi, Yiming Lei, Rui Pan,, Yiting Xia

TL;DR
Aurora optimizes mixture-of-experts inference by strategically deploying models and scheduling communication, significantly reducing latency and improving GPU utilization across diverse hardware setups.
Contribution
It introduces Aurora, the first method to jointly optimize model deployment and communication scheduling for MoE inference, with proven theoretical and practical speedups.
Findings
Achieves up to 3.54x inference speedup in heterogeneous environments.
Improves GPU utilization by up to 1.5x.
Provides optimal or near-optimal solutions across various GPU configurations.
Abstract
As machine learning models scale in size and complexity, their computational requirements become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by selectively activating relevant experts. Despite this, MoE models are hindered by high communication overhead from all-to-all operations, low GPU utilization due to the synchronous communication constraint, and complications from heterogeneous GPU environments. This paper presents Aurora, which optimizes both model deployment and all-to-all communication scheduling to address these challenges in MoE inference. Aurora achieves minimal communication times by strategically ordering token transmissions in all-to-all communications. It improves GPU utilization by colocating experts from different models on the same device, avoiding the limitations of synchronous all-to-all communication. We analyze Aurora's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed Sensor Networks and Detection Algorithms · Human-Automation Interaction and Safety · Context-Aware Activity Recognition Systems
MethodsMixture of Experts
