SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Shengkun Tang; Zekun Wang; Bo Zheng; Liangyu Wang; Rui Men; Siqi Zhang; Xiulong Yuan; Zihan Qiu; Zhiqiang Shen; Dayiheng Liu

arXiv:2605.08738·cs.LG·May 19, 2026

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu

PDF

TL;DR

This paper systematically investigates techniques for compressing large-scale mixture-of-experts models during pretraining, demonstrating effective pruning and distillation strategies that outperform training from scratch and improve downstream performance.

Contribution

It introduces a comprehensive study of MoE model compression methods, proposing a simple expert merging strategy and multi-token prediction distillation to enhance efficiency and performance.

Findings

01

Pruning pretrained MoE models outperforms training from scratch under the same budget.

02

Different expert compression methods converge to similar performance after large-scale pretraining.

03

Progressive pruning schedules outperform one-shot compression, leading to better optimization trajectories.

Abstract

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.