Visual Transformers: Token-based Image Representation and Processing for   Computer Vision

Bichen Wu; Chenfeng Xu; Xiaoliang Dai; Alvin Wan; Peizhao Zhang,; Zhicheng Yan; Masayoshi Tomizuka; Joseph Gonzalez; Kurt Keutzer; Peter Vajda

arXiv:2006.03677·cs.CV·November 23, 2020·378 cites

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang,, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, Peter Vajda

PDF

Open Access 5 Repos 10 Models

TL;DR

This paper introduces Visual Transformers that operate on semantic tokens instead of pixels, enabling more efficient and accurate image understanding by modeling relationships across image parts with less computation.

Contribution

It proposes a novel semantic token-based transformer approach for computer vision, outperforming convolutional models in accuracy and efficiency.

Findings

01

ResNet accuracy on ImageNet improved by 4.6 to 7 points.

02

FPN-based segmentation achieves 0.35 higher mIoU.

03

FPN FLOPs reduced by 6.5x.

Abstract

Computer vision has achieved remarkable success by (a) representing images as uniformly-arranged pixel arrays and (b) convolving highly-localized features. However, convolutions treat all image pixels equally regardless of importance; explicitly model all concepts across all images, regardless of content; and struggle to relate spatially-distant concepts. In this work, we challenge this paradigm by (a) representing images as semantic visual tokens and (b) running transformers to densely model token relationships. Critically, our Visual Transformer operates in a semantic token space, judiciously attending to different image parts based on context. This is in sharp contrast to pixel-space transformers that require orders-of-magnitude more compute. Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts, raising ResNet accuracy on ImageNet top-1…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Multi-Head Attention · Byte Pair Encoding · Softmax · Adam · Attention Is All You Need · Dropout · Layer Normalization