Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction   without Convolutions

Wenhai Wang; Enze Xie; Xiang Li; Deng-Ping Fan; Kaitao Song; Ding; Liang; Tong Lu; Ping Luo; Ling Shao

arXiv:2102.12122·cs.CV·August 12, 2021·58 cites

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding, Liang, Tong Lu, Ping Luo, Ling Shao

PDF

Open Access 5 Repos 10 Models

TL;DR

This paper introduces Pyramid Vision Transformer (PVT), a convolution-free backbone that enhances dense prediction tasks by combining Transformer advantages with high-resolution outputs and reduced computation, outperforming traditional CNNs.

Contribution

The paper proposes PVT, a novel Transformer-based backbone that supports high-resolution dense predictions and reduces computation, bridging the gap between CNNs and Transformers for vision tasks.

Findings

01

PVT achieves higher AP on COCO compared to ResNet50-based models.

02

PVT effectively boosts performance in object detection and segmentation tasks.

03

PVT demonstrates competitive results with fewer parameters.

Abstract

Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Spatial-Reduction Attention · Pyramid Vision Transformer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · Label Smoothing · Dropout