A Survey of Visual Transformers
Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang, Zhang, Zhongchao Shi, Jianping Fan, Zhiqiang He

TL;DR
This survey comprehensively reviews over one hundred visual Transformers, analyzing their structures, applications, and performance across fundamental computer vision tasks and data types, highlighting their advantages over CNNs.
Contribution
It provides a detailed taxonomy, comparative evaluation, and insights into unexploited aspects of visual Transformers, guiding future research directions in the field.
Findings
Visual Transformers outperform CNNs on multiple benchmarks.
A taxonomy organizes methods by motivation, structure, and application.
Identifies key unexploited aspects to enhance Transformer performance.
Abstract
Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing Transformer-liked architectures in the computer vision (CV) field, which have demonstrated their effectiveness on three fundamental CV tasks (classification, detection, and segmentation) as well as multiple sensory data stream (images, point clouds, and vision-language data). Because of their competitive modeling capabilities, the visual Transformers have achieved impressive performance improvements over multiple benchmarks as compared with modern Convolution Neural Networks (CNNs). In this survey, we have reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Byte Pair Encoding · Dense Connections
