Rethinking Spatial Dimensions of Vision Transformers
Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe,, Seong Joon Oh

TL;DR
This paper explores the importance of spatial dimension reduction in Vision Transformers, proposing PiT, a pooling-based model that improves performance across multiple vision tasks.
Contribution
It introduces PiT, a novel pooling-based Vision Transformer that leverages spatial dimension reduction principles inspired by CNNs, enhancing performance and generalization.
Findings
PiT outperforms ViT on image classification tasks.
PiT demonstrates improved object detection capabilities.
PiT shows increased robustness in evaluations.
Abstract
Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗kadirnar/timm_model_listmodel· ♡ 1♡ 1
- 🤗timm/pit_b_224.in1kmodel· 8.0k dl· ♡ 18.0k dl♡ 1
- 🤗timm/pit_b_distilled_224.in1kmodel· 106 dl106 dl
- 🤗timm/pit_s_224.in1kmodel· 907 dl907 dl
- 🤗timm/pit_s_distilled_224.in1kmodel· 1.3k dl1.3k dl
- 🤗timm/pit_ti_224.in1kmodel· 495 dl495 dl
- 🤗timm/pit_ti_distilled_224.in1kmodel· 75 dl75 dl
- 🤗timm/pit_xs_224.in1kmodel· 76 dl76 dl
- 🤗timm/pit_xs_distilled_224.in1kmodel· 71 dl· ♡ 171 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Attention Is All You Need · Dropout · Residual Connection · Byte Pair Encoding · Layer Normalization
