Faster MoE LLM Inference for Extremely Large Models

Haoqi Yang; Luohe Shi; Qiwei Li; Zuchao Li; Ping Wang; Bo Du; Mengjia; Shen; Hai Zhao

arXiv:2505.03531·cs.CL·May 7, 2025

Faster MoE LLM Inference for Extremely Large Models

Haoqi Yang, Luohe Shi, Qiwei Li, Zuchao Li, Ping Wang, Bo Du, Mengjia, Shen, Hai Zhao

PDF

Open Access

TL;DR

This paper explores optimizing inference efficiency for fine-grained Sparse Mixture of Experts (MoE) large language models, demonstrating methods to improve throughput with minimal performance loss in ultra-large-scale models.

Contribution

It introduces optimization techniques tailored for fine-grained MoE models, highlighting how reducing activated experts can enhance efficiency with limited performance impact.

Findings

01

Reducing activated experts improves throughput by at least 10%.

02

Limited efficiency gains from reducing total experts cause severe performance degradation.

03

Optimization potential remains significant for MoE inference in large-scale models.

Abstract

Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE architectures. With the emergence of DeepSeek Models, fine-grained MoE models are gaining popularity, yet research on them remains limited. Therefore, we want to discuss the efficiency dynamic under different service loads. Additionally, fine-grained models allow deployers to reduce the number of routed experts, both activated counts and total counts, raising the question of how this reduction affects the trade-off between MoE efficiency and performance. Our findings indicate that while deploying MoE models presents greater challenges, it also offers significant optimization opportunities. Reducing the number of activated experts can lead to substantial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Processing Techniques

Methodstravel james · Mixture of Experts