Improving the Downstream Performance of Mixture-of-Experts Transformers via Weak Vanilla Transformers
Xin Lu, Yanyan Zhao, Bing Qin, Ting Liu

TL;DR
This paper introduces transfer capability distillation to improve MoE Transformers' downstream performance by leveraging vanilla models as teachers, addressing transfer capability issues and enhancing transfer learning effectiveness.
Contribution
It proposes a novel transfer capability distillation method that guides MoE models using vanilla models to boost downstream task performance.
Findings
MoE models underperform vanilla models in transfer capability
Transfer capability distillation significantly improves MoE downstream performance
Experimental results on BERT show notable performance gains
Abstract
Recently, Mixture of Experts (MoE) Transformers have garnered increasing attention due to their advantages in model capacity and computational efficiency. However, studies have indicated that MoE Transformers underperform vanilla Transformers in many downstream tasks, significantly diminishing the practical value of MoE models. To explain this issue, we propose that the pre-training performance and transfer capability of a model are joint determinants of its downstream task performance. MoE models, in comparison to vanilla models, have poorer transfer capability, leading to their subpar performance in downstream tasks. To address this issue, we introduce the concept of transfer capability distillation, positing that although vanilla models have weaker performance, they are effective teachers of transfer capability. The MoE models guided by vanilla models can achieve both strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiochemical and biochemical processes · Plant Surface Properties and Treatments
MethodsAttention Is All You Need · Linear Layer · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Dropout · Multi-Head Attention · Attention Dropout · Linear Warmup With Linear Decay · Softmax
