Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou, Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz, Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller

TL;DR
Performers introduce a scalable Transformer architecture that approximates softmax attention with linear complexity using a novel kernel approximation method, enabling large-scale applications and detailed kernel comparisons.
Contribution
The paper presents a new linear attention mechanism, FAVOR+, that accurately approximates softmax attention without priors, with strong theoretical guarantees and broad applicability.
Findings
Performers achieve competitive results on diverse tasks.
FAVOR+ efficiently models kernelizable attention mechanisms.
The method offers provable accuracy with linear complexity.
Abstract
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Topic Modeling
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Fast Attention Via Positive Orthogonal Random Features · Performer · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Label Smoothing
