ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain
Md Sohag Mia, Abu Bakor Hayat Arnob, Abdu Naim, Abdullah Al Bary, Voban, Md Shariful Islam

TL;DR
This paper provides a comprehensive survey of Vision Transformers (ViTs), analyzing their applications, benefits, drawbacks, and potential for advancing various computer vision tasks compared to traditional CNNs.
Contribution
It is the first survey to categorize and evaluate ViTs across multiple CV applications, highlighting their advantages and outlining future research directions.
Findings
ViTs outperform CNNs in several vision benchmarks.
ViTs are applicable to diverse CV tasks like classification and segmentation.
The survey identifies open challenges and future research opportunities in ViT development.
Abstract
Transformer design is the de facto standard for natural language processing tasks. The success of the transformer design in natural language processing has lately piqued the interest of researchers in the domain of computer vision. When compared to Convolutional Neural Networks (CNNs), Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems. Transformer-based models outperform other types of networks, such as convolutional and recurrent neural networks, in a range of visual benchmarks. We evaluate various vision transformer models in this work by dividing them into distinct jobs and examining their benefits and drawbacks. ViTs can overcome several possible difficulties with convolutional neural networks (CNNs). The goal of this survey is to show the first use of ViTs in CV. In the first phase, we categorize various CV applications where ViTs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Softmax · Dense Connections · Vision Transformer
