Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models:   Enhancing Performance and Reducing Inference Costs

Enshu Liu; Junyi Zhu; Zinan Lin; Xuefei Ning; Matthew B. Blaschko,; Shengen Yan; Guohao Dai; Huazhong Yang; Yu Wang

arXiv:2407.00945·cs.LG·July 2, 2024·2 cites

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko,, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces EEP, a gradient-free evolutionary pruning method for sparse mixture-of-experts language models that reduces parameters and active experts, leading to faster inference and improved task performance without fine-tuning.

Contribution

The paper presents EEP, a novel gradient-free expert pruning strategy that enhances sparsity and performance of SMoE models, enabling more efficient deployment.

Findings

01

Pruning up to 75% of experts reduces parameters with minimal performance loss.

02

Pruning half of the experts significantly improves SQuAD accuracy from 53.4% to 75.4%.

03

Fewer experts can lead to better task-specific performance without fine-tuning.

Abstract

The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per token, thereby achieving faster inference while maintaining performance. However, SMoE models still face limitations in broader deployment due to their large parameter counts and significant GPU memory requirements. In this work, we introduce a gradient-free evolutionary strategy named EEP (Efficient Expert P}runing) to enhance the pruning of experts in SMoE models. EEP relies solely on model inference (i.e., no gradient computation) and achieves greater sparsity while maintaining or even improving performance on downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

imagination-research/eep
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · Topic Modeling · Speech and dialogue systems

MethodsPruning