Vision Transformer: Vit and its Derivatives

Zujun Fu

arXiv:2205.11239·cs.CV·May 25, 2022·5 cites

Vision Transformer: Vit and its Derivatives

Zujun Fu

PDF

Open Access

TL;DR

This paper reviews the Vision Transformer (ViT), a model that applies attention mechanisms from NLP to computer vision, achieving high performance on benchmarks and inspiring derivatives and cross-disciplinary applications.

Contribution

It provides a comprehensive review of ViT and its derivatives, highlighting its impact and adaptations in computer vision and other fields.

Findings

01

ViT achieves state-of-the-art results on ImageNet, COCO, and ADE20k.

02

ViT's self-attention mechanism effectively models visual data.

03

Multiple derivatives and cross-application areas of ViT are explored.

Abstract

Transformer, an attention-based encoder-decoder architecture, has not only revolutionized the field of natural language processing (NLP), but has also done some pioneering work in the field of computer vision (CV). Compared to convolutional neural networks (CNNs), the Vision Transformer (ViT) relies on excellent modeling capabilities to achieve very good performance on several benchmarks such as ImageNet, COCO, and ADE20k. ViT is inspired by the self-attention mechanism in natural language processing, where word embeddings are replaced with patch embeddings. This paper reviews the derivatives of ViT and the cross-applications of ViT with other fields.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dropout · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Label Smoothing