Post-Trained MoE Can Skip Half Experts via Self-Distillation
Xingtai Lv,Li Sheng,Kaiyan Zhang,Yichen You,Siyan Gao,Xueheng Luo,Yuxin Zuo,Yuchen Fan,Junlin Yang,Ganqu Cui,Bingning Wang,Fan Yang,Youbang Sun,Ning Ding,Bowen Zhou

TL;DR
This paper presents ZEDA, a low-cost method to convert fully trained static MoE models into efficient dynamic ones, significantly reducing expert computation with minimal accuracy loss.
Contribution
Introduces ZEDA, a novel self-distillation framework that enables post-training conversion of static MoE models into dynamic models with expert skipping capabilities.
Findings
Over 50% expert FLOPs eliminated with minimal accuracy loss
Outperforms previous dynamic MoE baselines by 6.1 and 4.0 points
Achieves approximately 1.20× inference speedup
Abstract
Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
