MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert   Pruning and Intra-Expert Low-Rank Decomposition

Cheng Yang; Yang Sui; Jinqi Xiao; Lingyi Huang; Yu Gong; Yuanlin Duan,; Wenqi Jia; Miao Yin; Yu Cheng; Bo Yuan

arXiv:2411.01016·cs.LG·November 5, 2024

MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition

Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan,, Wenqi Jia, Miao Yin, Yu Cheng, Bo Yuan

PDF

Open Access

TL;DR

This paper presents a two-stage compression approach for Mixture of Experts (MoE) language models, combining inter-expert pruning and intra-expert low-rank decomposition to reduce size and improve efficiency without sacrificing performance.

Contribution

The paper introduces a novel two-stage compression method for MoE models, including layer-wise genetic search and low-rank decomposition, to effectively reduce model size and computational cost.

Findings

01

Significant reduction in model size and inference cost.

02

Maintained performance on zero-shot tasks.

03

Validated on multiple large-scale MoE models.

Abstract

The emergence of Mixture of Experts (MoE) LLMs has significantly advanced the development of language models. Compared to traditional LLMs, MoE LLMs outperform traditional LLMs by achieving higher performance with considerably fewer activated parameters. Despite this efficiency, their enormous parameter size still leads to high deployment costs. In this paper, we introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost. First, in the inter-expert pruning stage, we analyze the importance of each layer and propose the Layer-wise Genetic Search and Block-wise KT-Reception Field with the non-uniform pruning ratio to prune the individual expert. Second, in the intra-expert decomposition stage, we apply the low-rank decomposition to further compress the parameters within the remaining experts. Extensive experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Expert finding and Q&A systems

MethodsMixture of Experts · Pruning