An Attention Free Transformer
Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin, Goh, Ruixiang Zhang, Josh Susskind

TL;DR
The paper presents Attention Free Transformer (AFT), a novel model that removes the need for dot product self-attention, reducing memory complexity and maintaining competitive performance across various tasks.
Contribution
It introduces AFT, a new transformer variant with linear memory complexity, and variants AFT-local and AFT-conv that incorporate locality and spatial sharing.
Findings
AFT achieves competitive results on CIFAR10, Enwik8, and ImageNet-1K.
AFT demonstrates improved efficiency compared to traditional Transformers.
AFT variants effectively incorporate locality and spatial sharing.
Abstract
We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Free Transformer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Byte Pair Encoding · Residual Connection
