Stand-Alone Self-Attention in Vision Models
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm, Levskaya, Jonathon Shlens

TL;DR
This paper demonstrates that self-attention can serve as a standalone layer in vision models, outperforming convolutional baselines in classification and detection tasks while reducing computational costs.
Contribution
It introduces a pure self-attention vision model that replaces convolutions, showing it can be effective without augmentation and improves efficiency and performance.
Findings
Self-attention outperforms convolutional models on ImageNet classification.
Pure self-attention matches baseline performance on COCO detection with fewer FLOPS.
Self-attention is especially effective in later network layers.
Abstract
Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsStand-Alone Self Attention · Average Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Feature Pyramid Network · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization
