EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan,, Kefeng Zhang, Xunliang Cai

TL;DR
EPS-MoE introduces a dynamic expert pipeline scheduler that optimizes computation and communication in MoE models, significantly improving inference throughput for large language models.
Contribution
The paper presents a novel expert pipeline scheduler for MoE that surpasses existing parallelism schemes by adaptively optimizing computation and communication overlap.
Findings
Up to 52.4% increase in prefill throughput.
Accelerated DeepSeekV2 model from 100K to 120K tokens/sec.
Demonstrated effectiveness on large language models.
Abstract
The Mixture-of-Experts (MoE) model has emerged as a prominent architecture in the field of Large Language Models (LLMs), providing a better balance between model performance and computational efficiency. However the General Matrix Multiply (GEMM) operations and large parameters introduce challenges related to computational efficiency and communication overhead, which become throughput bottlenecks during inference. Applying a single parallelism strategy like EP, DP, TP or a straightforward combination of them to MoE usually achieves sub-optimal inference throughput. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our approach optimizes the computation of MoE FeedForward Network (FFN) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Service-Oriented Architecture and Web Services · Distributed and Parallel Computing Systems
MethodsDense Connections · Feedforward Network · Mixture of Experts
