SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu

TL;DR
This paper systematically investigates techniques for compressing large-scale mixture-of-experts models during pretraining, demonstrating effective pruning and distillation strategies that outperform training from scratch and improve downstream performance.
Contribution
It introduces a comprehensive study of MoE model compression methods, proposing a simple expert merging strategy and multi-token prediction distillation to enhance efficiency and performance.
Findings
Pruning pretrained MoE models outperforms training from scratch under the same budget.
Different expert compression methods converge to similar performance after large-scale pretraining.
Progressive pruning schedules outperform one-shot compression, leading to better optimization trajectories.
Abstract
Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
