Vision Transformer: Vit and its Derivatives
Zujun Fu

TL;DR
This paper reviews the Vision Transformer (ViT), a model that applies attention mechanisms from NLP to computer vision, achieving high performance on benchmarks and inspiring derivatives and cross-disciplinary applications.
Contribution
It provides a comprehensive review of ViT and its derivatives, highlighting its impact and adaptations in computer vision and other fields.
Findings
ViT achieves state-of-the-art results on ImageNet, COCO, and ADE20k.
ViT's self-attention mechanism effectively models visual data.
Multiple derivatives and cross-application areas of ViT are explored.
Abstract
Transformer, an attention-based encoder-decoder architecture, has not only revolutionized the field of natural language processing (NLP), but has also done some pioneering work in the field of computer vision (CV). Compared to convolutional neural networks (CNNs), the Vision Transformer (ViT) relies on excellent modeling capabilities to achieve very good performance on several benchmarks such as ImageNet, COCO, and ADE20k. ViT is inspired by the self-attention mechanism in natural language processing, where word embeddings are replaced with patch embeddings. This paper reviews the derivatives of ViT and the cross-applications of ViT with other fields.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dropout · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Label Smoothing
