Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
James Lee-Thorp, Joshua Ainslie

TL;DR
This paper introduces Sparse Mixer, a novel encoder combining MoE and mixing transformations, achieving faster training and inference while maintaining competitive performance on NLP benchmarks.
Contribution
The paper presents Sparse Mixer, a new model that integrates MoE with mixing transformations, improving efficiency and stability over traditional MoE models.
Findings
Sparse Mixer outperforms BERT on GLUE and SuperGLUE by less than 1%.
Sparse Mixer trains 65% faster and runs inference 61% faster than BERT.
Fast Sparse Mixer trains and runs nearly twice as fast as BERT, with slight performance trade-offs.
Abstract
We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. Sparse Mixer slightly outperforms (<1%) BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, prosaically named Fast Sparse Mixer, that marginally underperforms BERT on SuperGLUE, but trains and runs nearly twice as fast. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and hyperparameters. Sparse Mixer overcomes many of the latency and stability concerns of MoE models and offers the prospect of serving sparse student models, without resorting to distilling them to dense variants.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Layer Normalization · Weight Decay · Linear Warmup With Linear Decay · Dense Connections · Dropout · Adam · Attention Dropout
