An Attention Free Transformer

Shuangfei Zhai; Walter Talbott; Nitish Srivastava; Chen Huang; Hanlin; Goh; Ruixiang Zhang; Josh Susskind

arXiv:2105.14103·cs.LG·September 23, 2021·42 cites

An Attention Free Transformer

Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin, Goh, Ruixiang Zhang, Josh Susskind

PDF

Open Access 5 Repos 1 Models

TL;DR

The paper presents Attention Free Transformer (AFT), a novel model that removes the need for dot product self-attention, reducing memory complexity and maintaining competitive performance across various tasks.

Contribution

It introduces AFT, a new transformer variant with linear memory complexity, and variants AFT-local and AFT-conv that incorporate locality and spatial sharing.

Findings

01

AFT achieves competitive results on CIFAR10, Enwik8, and ImageNet-1K.

02

AFT demonstrates improved efficiency compared to traditional Transformers.

03

AFT variants effectively incorporate locality and spatial sharing.

Abstract

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
AnshulRanjan2004/MicroRWKV
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Attention Free Transformer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Byte Pair Encoding · Residual Connection