An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk, Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias, Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

TL;DR
This paper demonstrates that pure transformer models, applied directly to image patches, can outperform traditional convolutional neural networks in image classification tasks when trained on large datasets.
Contribution
It introduces the Vision Transformer (ViT), showing that transformers can be effectively used for image recognition without convolutional components, achieving state-of-the-art results.
Findings
ViT outperforms CNNs on multiple benchmarks.
Pre-trained ViT requires less computational resources.
Large-scale pre-training is crucial for ViT performance.
Abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Falconsai/nsfw_image_detectionmodel· 40.1M dl· ♡ 102440.1M dl♡ 1024
- 🤗google/vit-base-patch16-224model· 4.3M dl· ♡ 9474.3M dl♡ 947
- 🤗Camais03/camie-tagger-v2model· 82 dl· ♡ 5482 dl♡ 54
- 🤗google/vit-base-patch16-224-in21kmodel· 4.3M dl· ♡ 4044.3M dl♡ 404
- 🤗google/paligemma-3b-pt-224model· 86k dl· ♡ 42686k dl♡ 426
- 🤗google/paligemma-3b-mix-448model· 2.9k dl· ♡ 1162.9k dl♡ 116
- 🤗facebook/dinov2-with-registers-largemodel· 113k dl· ♡ 12113k dl♡ 12
- 🤗timm/vit_base_patch16_dinov3.lvd1689mmodel· 97k dl· ♡ 697k dl♡ 6
- 🤗timm/vit_small_patch16_dinov3.lvd1689mmodel· 71k dl· ♡ 371k dl♡ 3
- 🤗Falconsai/nsfw_image_detection_26model· 3.4k dl· ♡ 33.4k dl♡ 3
Videos
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)· youtube
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained· youtube
Sora - Full Analysis (with new details)· youtube
How AI Vision Evolved | Merve Noyan· youtube
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
Methods15 Ways to Call How can i speak to human at Expedi-a: A Comprehensive Guide · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · FixRes · Vision Transformer · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding · Softmax · Adam
