Post-Trained MoE Can Skip Half Experts via Self-Distillation

Xingtai Lv,Li Sheng,Kaiyan Zhang,Yichen You,Siyan Gao,Xueheng Luo,Yuxin Zuo,Yuchen Fan,Junlin Yang,Ganqu Cui,Bingning Wang,Fan Yang,Youbang Sun,Ning Ding,Bowen Zhou

arXiv:2605.18643·cs.LG·May 19, 2026

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Xingtai Lv,Li Sheng,Kaiyan Zhang,Yichen You,Siyan Gao,Xueheng Luo,Yuxin Zuo,Yuchen Fan,Junlin Yang,Ganqu Cui,Bingning Wang,Fan Yang,Youbang Sun,Ning Ding,Bowen Zhou

PDF

1 Repo 2 Models 2 Datasets

TL;DR

This paper presents ZEDA, a low-cost method to convert fully trained static MoE models into efficient dynamic ones, significantly reducing expert computation with minimal accuracy loss.

Contribution

Introduces ZEDA, a novel self-distillation framework that enables post-training conversion of static MoE models into dynamic models with expert skipping capabilities.

Findings

01

Over 50% expert FLOPs eliminated with minimal accuracy loss

02

Outperforms previous dynamic MoE baselines by 6.1 and 4.0 points

03

Achieves approximately 1.20× inference speedup

Abstract

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsinghuac3i/ZEDA
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.