A Survey on Visual Transformer

Kai Han; Yunhe Wang; Hanting Chen; Xinghao Chen; Jianyuan Guo; Zhenhua; Liu; Yehui Tang; An Xiao; Chunjing Xu; Yixing Xu; Zhaohui Yang; Yiman Zhang,; Dacheng Tao

arXiv:2012.12556·cs.CV·July 11, 2023·220 cites

A Survey on Visual Transformer

Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua, Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang,, Dacheng Tao

PDF

Open Access

TL;DR

This survey reviews the development and application of vision transformers in computer vision, highlighting their advantages, challenges, and future research directions across various tasks and efficiency improvements.

Contribution

It categorizes vision transformer models by task, analyzes their strengths and weaknesses, and discusses recent advances and challenges in applying transformers to computer vision.

Findings

01

Transformers achieve competitive or superior performance in vision tasks.

02

Efficient transformer methods enable real-device applications.

03

Analysis of self-attention mechanisms in vision tasks.

Abstract

Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent neural networks. Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer