Transformers in Vision: A Survey
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad, Shahbaz Khan, Mubarak Shah

TL;DR
This survey reviews the application of Transformer models in computer vision, highlighting their advantages, diverse tasks, and future research directions, emphasizing their scalability and modality versatility.
Contribution
It provides a comprehensive overview of Transformer-based methods in vision, covering fundamental concepts, applications, comparisons, and future research directions.
Findings
Transformers enable modeling long-range dependencies in vision tasks.
They support multi-modal processing with minimal inductive biases.
Transformers demonstrate scalability to large datasets and models.
Abstract
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dropout · Byte Pair Encoding · Dense Connections · Label Smoothing · Multi-Head Attention · Attention Is All You Need
