SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation

Zichong Li; Chen Liang; Zixuan Zhang; Ilgee Hong; Young Jin Kim; Weizhu Chen; Tuo Zhao

arXiv:2506.18349·cs.LG·June 24, 2025

SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation

Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, Tuo Zhao

PDF

5 Models

TL;DR

This paper introduces SlimMoE, a multi-stage compression framework that significantly reduces large MoE models' size and memory requirements through expert slimming and distillation, enabling efficient deployment without substantial performance loss.

Contribution

The paper presents a novel structured compression method for large MoE models using expert slimming and staged distillation, reducing parameters while maintaining performance.

Findings

01

Compressed models outperform similar-sized models.

02

Achieved high performance with less training data and resources.

03

Models are suitable for resource-limited environments.

Abstract

The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their enormous memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants without incurring the prohibitive costs of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation common in one-shot pruning approaches. Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.