Rethinking Attention with Performers

Krzysztof Choromanski; Valerii Likhosherstov; David Dohan; Xingyou; Song; Andreea Gane; Tamas Sarlos; Peter Hawkins; Jared Davis; Afroz; Mohiuddin; Lukasz Kaiser; David Belanger; Lucy Colwell; Adrian Weller

arXiv:2009.14794·cs.LG·November 22, 2022·122 cites

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou, Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz, Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller

PDF

Open Access 5 Repos 2 Models 2 Videos

TL;DR

Performers introduce a scalable Transformer architecture that approximates softmax attention with linear complexity using a novel kernel approximation method, enabling large-scale applications and detailed kernel comparisons.

Contribution

The paper presents a new linear attention mechanism, FAVOR+, that accurately approximates softmax attention without priors, with strong theoretical guarantees and broad applicability.

Findings

01

Performers achieve competitive results on diverse tasks.

02

FAVOR+ efficiently models kernelizable attention mechanisms.

03

The method offers provable accuracy with linear complexity.

Abstract

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Rethinking Attention with Performers (Paper Explained)· youtube

Rethinking Attention with Performers· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Topic Modeling

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Fast Attention Via Positive Orthogonal Random Features · Performer · Dense Connections · Dropout · Layer Normalization · Byte Pair Encoding · Label Smoothing