Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

Sieun Hyeon; Jaeyoung Do

arXiv:2603.02217·cs.LG·March 4, 2026

Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

Sieun Hyeon, Jaeyoung Do

PDF

Open Access

TL;DR

This paper investigates the importance of router calibration in retraining-free MoE compression, proposing a lightweight router distillation method to improve performance without updating expert parameters.

Contribution

It introduces Router Knowledge Distillation, a novel lightweight router calibration technique that enhances retraining-free MoE compression by addressing router-expert mismatch.

Findings

01

Router KD effectively recovers performance across compression paradigms.

02

Fine-grained MoEs benefit more from router calibration than coarse-grained ones.

03

Persistent degradation is mainly due to router-expert mismatch, not expert parameters.

Abstract

Mixture-of-Experts (MoE) models scale capacity efficiently, but their massive parameter footprint creates a deployment-time memory bottleneck. We organize retraining-free MoE compression into three paradigms - Expert Pruning, Expert Editing, and Expert Merging - and show that persistent post-compression degradation largely stems from a neglected factor: router-expert mismatch when experts are changed but the router is left untouched. We argue that effective retraining-free compression should avoid updating expert parameters while allowing lightweight router calibration. To this end, we propose Router Knowledge Distillation (Router KD), which updates only a tiny fraction of parameters (the router) by distilling the original model's next-token distribution on unlabeled calibration data. Experiments across representative methods in all three paradigms demonstrate consistent performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Traffic and Congestion Control · Image and Video Quality Assessment · Stochastic Gradient Optimization Techniques