Not All Experts are Equal: Efficient Expert Pruning and Skipping for   Mixture-of-Experts Large Language Models

Xudong Lu; Qi Liu; Yuhui Xu; Aojun Zhou; Siyuan Huang; Bo Zhang,; Junchi Yan; Hongsheng Li

arXiv:2402.14800·cs.CL·May 31, 2024·1 cites

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang,, Junchi Yan, Hongsheng Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces novel post-training expert pruning and skipping techniques for Mixture-of-Experts large language models, significantly improving deployment efficiency by reducing size and increasing speed without sacrificing performance.

Contribution

It presents the first task-agnostic and task-specific expert pruning methods for MoE LLMs that are plug-and-play and improve deployment efficiency.

Findings

01

Model size reduced significantly

02

Inference speed increased substantially

03

Performance maintained across various tasks

Abstract

A pivotal advancement in the progress of large language models (LLMs) is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. Different from previous weight pruning methods that rely on specifically designed hardware, this paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques. Specifically, we propose, for the first time to our best knowledge, post-training approaches for task-agnostic and task-specific expert pruning and skipping of MoE LLMs, tailored to improve deployment efficiency while maintaining model performance across a wide range of tasks. Extensive experiments show that our proposed methods can simultaneously reduce model sizes and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucky-lance/expert_sparsity
pytorchOfficial

Videos

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models· underline

Taxonomy

TopicsExpert finding and Q&A systems · Recommender Systems and Techniques · Speech and dialogue systems

MethodsPruning