SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

R\'obert Csord\'as; Piotr Pi\k{e}kos; Kazuki Irie; J\"urgen; Schmidhuber

arXiv:2312.07987·cs.LG·October 2, 2024·2 cites

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

R\'obert Csord\'as, Piotr Pi\k{e}kos, Kazuki Irie, J\"urgen, Schmidhuber

PDF

Open Access 2 Repos

TL;DR

SwitchHead introduces a novel Mixture-of-Experts attention mechanism for Transformers, significantly reducing computation and memory while maintaining performance, and enabling fully-MoE models with improved efficiency and downstream task results.

Contribution

We propose SwitchHead, a new MoE method for attention layers that reduces compute and memory needs without sacrificing language modeling performance.

Findings

01

SwitchHead computes up to 8 times fewer attention matrices.

02

SwitchHead matches baseline perplexity with 44% compute and 27% memory.

03

SwitchAll models with SwitchHead outperform baseline on downstream tasks.

Abstract

Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques · Topic Modeling

MethodsFocus · Multi-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Dropout · Layer Normalization · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings