A Survey of Visual Transformers

Yang Liu; Yao Zhang; Yixin Wang; Feng Hou; Jin Yuan; Jiang Tian; Yang; Zhang; Zhongchao Shi; Jianping Fan; Zhiqiang He

arXiv:2111.06091·cs.CV·December 7, 2022·42 cites

A Survey of Visual Transformers

Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang, Zhang, Zhongchao Shi, Jianping Fan, Zhiqiang He

PDF

Open Access 1 Repo

TL;DR

This survey comprehensively reviews over one hundred visual Transformers, analyzing their structures, applications, and performance across fundamental computer vision tasks and data types, highlighting their advantages over CNNs.

Contribution

It provides a detailed taxonomy, comparative evaluation, and insights into unexploited aspects of visual Transformers, guiding future research directions in the field.

Findings

01

Visual Transformers outperform CNNs on multiple benchmarks.

02

A taxonomy organizes methods by motivation, structure, and application.

03

Identifies key unexploited aspects to enhance Transformer performance.

Abstract

Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing Transformer-liked architectures in the computer vision (CV) field, which have demonstrated their effectiveness on three fundamental CV tasks (classification, detection, and segmentation) as well as multiple sensory data stream (images, point clouds, and vision-language data). Because of their competitive modeling capabilities, the visual Transformers have achieved impressive performance improvements over multiple benchmarks as compared with modern Convolution Neural Networks (CNNs). In this survey, we have reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liuyang-ict/awesome-visual-transformers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Byte Pair Encoding · Dense Connections