Stand-Alone Self-Attention in Vision Models

Prajit Ramachandran; Niki Parmar; Ashish Vaswani; Irwan Bello; Anselm; Levskaya; Jonathon Shlens

arXiv:1906.05909·cs.CV·June 17, 2019·222 cites

Stand-Alone Self-Attention in Vision Models

Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm, Levskaya, Jonathon Shlens

PDF

Open Access 5 Repos

TL;DR

This paper demonstrates that self-attention can serve as a standalone layer in vision models, outperforming convolutional baselines in classification and detection tasks while reducing computational costs.

Contribution

It introduces a pure self-attention vision model that replaces convolutions, showing it can be effective without augmentation and improves efficiency and performance.

Findings

01

Self-attention outperforms convolutional models on ImageNet classification.

02

Pure self-attention matches baseline performance on COCO detection with fewer FLOPS.

03

Self-attention is especially effective in later network layers.

Abstract

Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsStand-Alone Self Attention · Average Pooling · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Batch Normalization · Feature Pyramid Network · Bottleneck Residual Block · Global Average Pooling · Residual Block · Kaiming Initialization