Rethinking Spatial Dimensions of Vision Transformers

Byeongho Heo; Sangdoo Yun; Dongyoon Han; Sanghyuk Chun; Junsuk Choe,; Seong Joon Oh

arXiv:2103.16302·cs.CV·August 19, 2021·67 cites

Rethinking Spatial Dimensions of Vision Transformers

Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe,, Seong Joon Oh

PDF

Open Access 5 Repos 9 Models

TL;DR

This paper explores the importance of spatial dimension reduction in Vision Transformers, proposing PiT, a pooling-based model that improves performance across multiple vision tasks.

Contribution

It introduces PiT, a novel pooling-based Vision Transformer that leverages spatial dimension reduction principles inspired by CNNs, enhancing performance and generalization.

Findings

01

PiT outperforms ViT on image classification tasks.

02

PiT demonstrates improved object detection capabilities.

03

PiT shows increased robustness in evaluations.

Abstract

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Attention Is All You Need · Dropout · Residual Connection · Byte Pair Encoding · Layer Normalization