One Student Knows All Experts Know: From Sparse to Dense
Fuzhao Xue, Xiaoxin He, Xiaozhe Ren, Yuxuan Lou, Yang You

TL;DR
This paper introduces a knowledge integration framework to convert sparse mixture-of-experts models into dense, efficient models that retain similar performance, making deployment easier and faster.
Contribution
The paper proposes a novel task and training framework for transforming sparse MoE models into dense models using knowledge gathering and distillation, with four new knowledge gathering methods.
Findings
Achieves 78.4% top-1 accuracy on ImageNet with 15M parameters.
Outperforms baseline models by 51.7% on NLP datasets.
Provides 3.7x inference speedup compared to MoE models.
Abstract
Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is easy to overfit, hard to deploy, and not hardware-friendly for practitioners. In this work, inspired by the human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, to gather key knowledge from different pre-trained experts, we first investigate four different possible knowledge gathering methods, \ie summation, averaging, Top-K Knowledge Gathering (Top-KG), and Singular Value Decomposition Knowledge Gathering (SVD-KG) proposed in this paper. We then refine the dense student model by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsKnowledge Distillation
