TL;DR
This paper introduces SlimMoE, a multi-stage compression framework that significantly reduces large MoE models' size and memory requirements through expert slimming and distillation, enabling efficient deployment without substantial performance loss.
Contribution
The paper presents a novel structured compression method for large MoE models using expert slimming and staged distillation, reducing parameters while maintaining performance.
Findings
Compressed models outperform similar-sized models.
Achieved high performance with less training data and resources.
Models are suitable for resource-limited environments.
Abstract
The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their enormous memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants without incurring the prohibitive costs of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation common in one-shot pruning approaches. Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/Phi-mini-MoE-instructmodel· 100k dl· ♡ 32100k dl♡ 32
- 🤗microsoft/Phi-tiny-MoE-instructmodel· 548k dl· ♡ 35548k dl♡ 35
- 🤗gabriellarson/Phi-mini-MoE-instruct-GGUFmodel· 2.7k dl· ♡ 72.7k dl♡ 7
- 🤗FriendliAI/Phi-mini-MoE-instructmodel· 238 dl· ♡ 1238 dl♡ 1
- 🤗FriendliAI/Phi-tiny-MoE-instructmodel· 32 dl32 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
