An Image is Worth 16x16 Words: Transformers for Image Recognition at   Scale

Alexey Dosovitskiy; Lucas Beyer; Alexander Kolesnikov; Dirk; Weissenborn; Xiaohua Zhai; Thomas Unterthiner; Mostafa Dehghani; Matthias; Minderer; Georg Heigold; Sylvain Gelly; Jakob Uszkoreit; Neil Houlsby

arXiv:2010.11929·cs.CV·June 4, 2021·21k cites

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk, Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias, Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

PDF

Open Access 5 Repos 10 Models 5 Datasets 5 Videos

TL;DR

This paper demonstrates that pure transformer models, applied directly to image patches, can outperform traditional convolutional neural networks in image classification tasks when trained on large datasets.

Contribution

It introduces the Vision Transformer (ViT), showing that transformers can be effectively used for image recognition without convolutional components, achieving state-of-the-art results.

Findings

01

ViT outperforms CNNs on multiple benchmarks.

02

Pre-trained ViT requires less computational resources.

03

Large-scale pre-training is crucial for ViT performance.

Abstract

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)· youtube

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained· youtube

Sora - Full Analysis (with new details)· youtube

How AI Vision Evolved | Merve Noyan· youtube

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning

Methods15 Ways to Call How can i speak to human at Expedi-a: A Comprehensive Guide · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · FixRes · Vision Transformer · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding · Softmax · Adam