Visual Transformers: Token-based Image Representation and Processing for Computer Vision
Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang,, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, Peter Vajda

TL;DR
This paper introduces Visual Transformers that operate on semantic tokens instead of pixels, enabling more efficient and accurate image understanding by modeling relationships across image parts with less computation.
Contribution
It proposes a novel semantic token-based transformer approach for computer vision, outperforming convolutional models in accuracy and efficiency.
Findings
ResNet accuracy on ImageNet improved by 4.6 to 7 points.
FPN-based segmentation achieves 0.35 higher mIoU.
FPN FLOPs reduced by 6.5x.
Abstract
Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance; explicitly model all concepts across all images, regardless of content; and struggle to relate spatially-distant concepts. In this work, we challenge this paradigm by (a) representing images as semantic visual tokens and (b) running transformers to densely model token relationships. Critically, our Visual Transformer operates in a semantic token space, judiciously attending to different image parts based on context. This is in sharp contrast to pixel-space transformers that require orders-of-magnitude more compute. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts, raising ResNet accuracy on ImageNet top-1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/vit-base-patch16-224model· 4.3M dl· ♡ 9474.3M dl♡ 947
- 🤗google/vit-base-patch16-224-in21kmodel· 4.3M dl· ♡ 4044.3M dl♡ 404
- 🤗facebook/deit-base-distilled-patch16-224model· 7.4k dl· ♡ 337.4k dl♡ 33
- 🤗facebook/deit-base-distilled-patch16-384model· 76k dl· ♡ 876k dl♡ 8
- 🤗facebook/deit-base-patch16-224model· 24k dl· ♡ 1524k dl♡ 15
- 🤗facebook/deit-base-patch16-384model· 219 dl· ♡ 3219 dl♡ 3
- 🤗facebook/deit-small-distilled-patch16-224model· 1.2k dl· ♡ 71.2k dl♡ 7
- 🤗facebook/deit-small-patch16-224model· 14k dl· ♡ 1114k dl♡ 11
- 🤗facebook/deit-tiny-distilled-patch16-224model· 548 dl· ♡ 9548 dl♡ 9
- 🤗facebook/deit-tiny-patch16-224model· 126k dl· ♡ 12126k dl♡ 12
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Multi-Head Attention · Byte Pair Encoding · Softmax · Adam · Attention Is All You Need · Dropout · Layer Normalization
