EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

Yulei Qian; Fengcun Li; Xiangyang Ji; Xiaoyu Zhao; Jianchao Tan,; Kefeng Zhang; Xunliang Cai

arXiv:2410.12247·cs.CL·January 6, 2025

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan,, Kefeng Zhang, Xunliang Cai

PDF

Open Access

TL;DR

EPS-MoE introduces a dynamic expert pipeline scheduler that optimizes computation and communication in MoE models, significantly improving inference throughput for large language models.

Contribution

The paper presents a novel expert pipeline scheduler for MoE that surpasses existing parallelism schemes by adaptively optimizing computation and communication overlap.

Findings

01

Up to 52.4% increase in prefill throughput.

02

Accelerated DeepSeekV2 model from 100K to 120K tokens/sec.

03

Demonstrated effectiveness on large language models.

Abstract

The Mixture-of-Experts (MoE) model has emerged as a prominent architecture in the field of Large Language Models (LLMs), providing a better balance between model performance and computational efficiency. However the General Matrix Multiply (GEMM) operations and large parameters introduce challenges related to computational efficiency and communication overhead, which become throughput bottlenecks during inference. Applying a single parallelism strategy like EP, DP, TP or a straightforward combination of them to MoE usually achieves sub-optimal inference throughput. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our approach optimizes the computation of MoE FeedForward Network (FFN) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Service-Oriented Architecture and Web Services · Distributed and Parallel Computing Systems

MethodsDense Connections · Feedforward Network · Mixture of Experts