Efficient Language Modeling with Sparse all-MLP

Ping Yu; Mikel Artetxe; Myle Ott; Sam Shleifer; Hongyu Gong; Ves; Stoyanov; Xian Li

arXiv:2203.06850·cs.CL·June 2, 2022

Efficient Language Modeling with Sparse all-MLP

Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves, Stoyanov, Xian Li

PDF

Open Access

TL;DR

This paper introduces sparse all-MLP models with mixture-of-experts that significantly enhance language modeling capacity and efficiency, outperforming Transformers and dense MLPs in perplexity and downstream tasks.

Contribution

It proposes a novel sparse all-MLP architecture with mixture-of-experts, addressing expressiveness limitations and achieving superior performance and efficiency over existing models.

Findings

01

Up to 2× training efficiency improvement.

02

Outperforms Transformer-based MoEs in perplexity.

03

Surpasses dense Transformers in downstream tasks.

Abstract

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2 $\times$ improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Dense Connections · Residual Connection · Spatial Gating Unit · Layer Normalization · Balanced Selection