Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

James Lee-Thorp; Joshua Ainslie

arXiv:2205.12399·cs.LG·October 14, 2022·1 cites

Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

James Lee-Thorp, Joshua Ainslie

PDF

Open Access 1 Repo

TL;DR

This paper introduces Sparse Mixer, a novel encoder combining MoE and mixing transformations, achieving faster training and inference while maintaining competitive performance on NLP benchmarks.

Contribution

The paper presents Sparse Mixer, a new model that integrates MoE with mixing transformations, improving efficiency and stability over traditional MoE models.

Findings

01

Sparse Mixer outperforms BERT on GLUE and SuperGLUE by less than 1%.

02

Sparse Mixer trains 65% faster and runs inference 61% faster than BERT.

03

Fast Sparse Mixer trains and runs nearly twice as fast as BERT, with slight performance trade-offs.

Abstract

We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. Sparse Mixer slightly outperforms (<1%) BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, prosaically named Fast Sparse Mixer, that marginally underperforms BERT on SuperGLUE, but trains and runs nearly twice as fast. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and hyperparameters. Sparse Mixer overcomes many of the latency and stability concerns of MoE models and offers the prospect of serving sparse student models, without resorting to distilling them to dense variants.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/google-research
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Layer Normalization · Weight Decay · Linear Warmup With Linear Decay · Dense Connections · Dropout · Adam · Attention Dropout