Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable   Transformers

Tianlong Chen; Zhenyu Zhang; Ajay Jaiswal; Shiwei Liu; Zhangyang Wang

arXiv:2303.01610·cs.LG·March 6, 2023·6 cites

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Tianlong Chen, Zhenyu Zhang, Ajay Jaiswal, Shiwei Liu, Zhangyang Wang

PDF

Open Access 1 Repo 4 Models 1 Video

TL;DR

This paper introduces SMoE-Dropout, a novel training framework for sparse Mixture-of-Experts transformers that enhances scalability, reduces redundancy, and enables resource-adaptive performance improvements during inference and fine-tuning.

Contribution

It proposes a plug-and-play SMoE-Dropout method that improves transformer scalability and performance without expert collapse, leveraging a fixed router and increasing active experts over training.

Findings

01

Outperforms dense BERT with 1.03%, 0.78%, 1.09% gains on reasoning tasks.

02

Achieves significant computation savings compared to dense baselines.

03

Enables resource-aware, self-slimmable transformer performance.

Abstract

Despite their remarkable achievement, gigantic transformers encounter significant drawbacks, including exorbitant computational and memory footprints during training, as well as severe collapse evidenced by a high degree of parameter redundancy. Sparsely-activated Mixture-of-Experts (SMoEs) have shown promise to mitigate the issue of training efficiency, yet they are prone to (1) redundant experts due to representational collapse; and (2) poor expert scalability for inference and downstream fine-tuning, primarily due to overfitting of the learned routing policy to the number of activated experts during training. As recent research efforts are predominantly focused on improving routing policies to encourage expert specializations, this work focuses on exploring the overlooked scalability bottleneck of SMoEs and leveraging it to effectively scale dense transformers. To this end, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vita-group/random-moe-as-dropout
pytorchOfficial

Models

Videos

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and ELM · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Dropout · Softmax · Adam