Improving the Downstream Performance of Mixture-of-Experts Transformers via Weak Vanilla Transformers

Xin Lu; Yanyan Zhao; Bing Qin; Ting Liu

arXiv:2403.01994·cs.CL·November 17, 2025·1 cites

Improving the Downstream Performance of Mixture-of-Experts Transformers via Weak Vanilla Transformers

Xin Lu, Yanyan Zhao, Bing Qin, Ting Liu

PDF

Open Access

TL;DR

This paper introduces transfer capability distillation to improve MoE Transformers' downstream performance by leveraging vanilla models as teachers, addressing transfer capability issues and enhancing transfer learning effectiveness.

Contribution

It proposes a novel transfer capability distillation method that guides MoE models using vanilla models to boost downstream task performance.

Findings

01

MoE models underperform vanilla models in transfer capability

02

Transfer capability distillation significantly improves MoE downstream performance

03

Experimental results on BERT show notable performance gains

Abstract

Recently, Mixture of Experts (MoE) Transformers have garnered increasing attention due to their advantages in model capacity and computational efficiency. However, studies have indicated that MoE Transformers underperform vanilla Transformers in many downstream tasks, significantly diminishing the practical value of MoE models. To explain this issue, we propose that the pre-training performance and transfer capability of a model are joint determinants of its downstream task performance. MoE models, in comparison to vanilla models, have poorer transfer capability, leading to their subpar performance in downstream tasks. To address this issue, we introduce the concept of transfer capability distillation, positing that although vanilla models have weaker performance, they are effective teachers of transfer capability. The MoE models guided by vanilla models can achieve both strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiochemical and biochemical processes · Plant Surface Properties and Treatments

MethodsAttention Is All You Need · Linear Layer · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Dropout · Multi-Head Attention · Attention Dropout · Linear Warmup With Linear Decay · Softmax