Optimizing Mixture-of-Experts Inference Time Combining Model Deployment   and Communication Scheduling

Jialong Li; Shreyansh Tripathi; Lakshay Rastogi; Yiming Lei; Rui Pan,; Yiting Xia

arXiv:2410.17043·cs.LG·October 23, 2024

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

Jialong Li, Shreyansh Tripathi, Lakshay Rastogi, Yiming Lei, Rui Pan,, Yiting Xia

PDF

Open Access

TL;DR

Aurora optimizes mixture-of-experts inference by strategically deploying models and scheduling communication, significantly reducing latency and improving GPU utilization across diverse hardware setups.

Contribution

It introduces Aurora, the first method to jointly optimize model deployment and communication scheduling for MoE inference, with proven theoretical and practical speedups.

Findings

01

Achieves up to 3.54x inference speedup in heterogeneous environments.

02

Improves GPU utilization by up to 1.5x.

03

Provides optimal or near-optimal solutions across various GPU configurations.

Abstract

As machine learning models scale in size and complexity, their computational requirements become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by selectively activating relevant experts. Despite this, MoE models are hindered by high communication overhead from all-to-all operations, low GPU utilization due to the synchronous communication constraint, and complications from heterogeneous GPU environments. This paper presents Aurora, which optimizes both model deployment and all-to-all communication scheduling to address these challenges in MoE inference. Aurora achieves minimal communication times by strategically ordering token transmissions in all-to-all communications. It improves GPU utilization by colocating experts from different models on the same device, avoiding the limitations of synchronous all-to-all communication. We analyze Aurora's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Sensor Networks and Detection Algorithms · Human-Automation Interaction and Safety · Context-Aware Activity Recognition Systems

MethodsMixture of Experts