Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding, Liang, Tong Lu, Ping Luo, Ling Shao

TL;DR
This paper introduces Pyramid Vision Transformer (PVT), a convolution-free backbone that enhances dense prediction tasks by combining Transformer advantages with high-resolution outputs and reduced computation, outperforming traditional CNNs.
Contribution
The paper proposes PVT, a novel Transformer-based backbone that supports high-resolution dense predictions and reduces computation, bridging the gap between CNNs and Transformers for vision tasks.
Findings
PVT achieves higher AP on COCO compared to ResNet50-based models.
PVT effectively boosts performance in object detection and segmentation tasks.
PVT demonstrates competitive results with fewer parameters.
Abstract
Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Xrenya/pvt-tiny-224model· 273 dl273 dl
- 🤗Xrenya/pvt-small-224model· 25 dl25 dl
- 🤗Xrenya/pvt-medium-224model· 25 dl25 dl
- 🤗Xrenya/pvt-large-224model· 9 dl· ♡ 19 dl♡ 1
- 🤗Zetatech/pvt-large-224model· 10 dl10 dl
- 🤗Zetatech/pvt-medium-224model· 12 dl12 dl
- 🤗Zetatech/pvt-tiny-224model· 2.8k dl2.8k dl
- 🤗Zetatech/pvt-small-224model· 9 dl9 dl
- 🤗mccaly/test2model· 12 dl· ♡ 112 dl♡ 1
- 🤗qninhdt/detmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Spatial-Reduction Attention · Pyramid Vision Transformer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · Label Smoothing · Dropout
