PVT v2: Improved Baselines with Pyramid Vision Transformer

Wenhai Wang; Enze Xie; Xiang Li; Deng-Ping Fan; Kaitao Song; Ding; Liang; Tong Lu; Ping Luo; Ling Shao

arXiv:2106.13797·cs.CV·April 18, 2023

PVT v2: Improved Baselines with Pyramid Vision Transformer

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding, Liang, Tong Lu, Ping Luo, Ling Shao

PDF

5 Repos 10 Models

TL;DR

PVT v2 introduces key improvements to the Pyramid Vision Transformer, achieving linear complexity and superior performance on vision tasks, thus advancing transformer-based methods in computer vision.

Contribution

The paper presents PVT v2 with three novel design enhancements, significantly reducing complexity and boosting performance over previous transformer models.

Findings

01

Achieves linear computational complexity.

02

Outperforms or matches recent transformer models.

03

Enhances performance on classification, detection, and segmentation.

Abstract

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Depthwise Convolution · Stochastic Depth · Pyramid Vision Transformer v2 · Swin Transformer