Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work
Khawar Islam

TL;DR
This survey reviews recent developments in Vision Transformers, comparing their performance, strengths, and limitations, and discusses future research directions in the field of computer vision.
Contribution
It provides a comprehensive overview of recent ViT methods, analyzing their strengths, weaknesses, computational costs, and benchmarking performance against CNNs.
Findings
ViTs outperform CNNs on several vision tasks.
Current ViTs face limitations in computational efficiency.
Future research should address scalability and robustness.
Abstract
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs). As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships. In this paper, we begin by introducing the fundamental concepts and background of the self-attention mechanism. Next, we provide a comprehensive overview of recent top-performing ViT methods describing in terms of strength and weakness, computational cost as well as training and testing dataset. We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets. Finally, we explore some limitations with insightful observations and provide further research direction. The project page along with the collections of papers are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Generative Adversarial Networks and Image Synthesis
