MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan,, Wenqi Jia, Miao Yin, Yu Cheng, Bo Yuan

TL;DR
This paper presents a two-stage compression approach for Mixture of Experts (MoE) language models, combining inter-expert pruning and intra-expert low-rank decomposition to reduce size and improve efficiency without sacrificing performance.
Contribution
The paper introduces a novel two-stage compression method for MoE models, including layer-wise genetic search and low-rank decomposition, to effectively reduce model size and computational cost.
Findings
Significant reduction in model size and inference cost.
Maintained performance on zero-shot tasks.
Validated on multiple large-scale MoE models.
Abstract
The emergence of Mixture of Experts (MoE) LLMs has significantly advanced the development of language models. Compared to traditional LLMs, MoE LLMs outperform traditional LLMs by achieving higher performance with considerably fewer activated parameters. Despite this efficiency, their enormous parameter size still leads to high deployment costs. In this paper, we introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost. First, in the inter-expert pruning stage, we analyze the importance of each layer and propose the Layer-wise Genetic Search and Block-wise KT-Reception Field with the non-uniform pruning ratio to prune the individual expert. Second, in the intra-expert decomposition stage, we apply the low-rank decomposition to further compress the parameters within the remaining experts. Extensive experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Expert finding and Q&A systems
MethodsMixture of Experts · Pruning
