SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
R\'obert Csord\'as, Piotr Pi\k{e}kos, Kazuki Irie, J\"urgen, Schmidhuber

TL;DR
SwitchHead introduces a novel Mixture-of-Experts attention mechanism for Transformers, significantly reducing computation and memory while maintaining performance, and enabling fully-MoE models with improved efficiency and downstream task results.
Contribution
We propose SwitchHead, a new MoE method for attention layers that reduces compute and memory needs without sacrificing language modeling performance.
Findings
SwitchHead computes up to 8 times fewer attention matrices.
SwitchHead matches baseline perplexity with 44% compute and 27% memory.
SwitchAll models with SwitchHead outperform baseline on downstream tasks.
Abstract
Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques · Topic Modeling
MethodsFocus · Multi-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Dropout · Layer Normalization · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings
