Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models
Gyeongman Kim, Gyouk Chu, Eunho Yang

TL;DR
This paper introduces two novel knowledge distillation methods tailored for Mixture-of-Experts language models, effectively leveraging all experts' knowledge to improve model compression and performance.
Contribution
It proposes MoE-specific KD techniques, Knowledge Augmentation and Student-Aware Router, addressing the limitations of existing methods for MoE models.
Findings
KA and SAR outperform traditional KD methods on MoE models
Both methods effectively utilize all experts' knowledge
Significant model compression achieved with maintained performance
Abstract
With the emergence of Mixture-of-Experts (MoE), the efficient scaling of model size has accelerated the development of large language models in recent years. However, their high memory requirements prevent their use in resource-constrained environments. While knowledge distillation (KD) has been a proven method for model compression, its application to MoE teacher models remains underexplored. Through our investigation, we discover that non-activated experts in MoE models possess valuable knowledge that benefits student models. We further demonstrate that existing KD methods are not optimal for compressing MoE models, as they fail to leverage this knowledge effectively. To address this, we propose two intuitive MoE-specific KD methods for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR), both designed to effectively extract knowledge from all experts.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Topic Modeling · Speech and dialogue systems
MethodsKnowledge Distillation · Mixture of Experts
