TL;DR
PVT v2 introduces key improvements to the Pyramid Vision Transformer, achieving linear complexity and superior performance on vision tasks, thus advancing transformer-based methods in computer vision.
Contribution
The paper presents PVT v2 with three novel design enhancements, significantly reducing complexity and boosting performance over previous transformer models.
Findings
Achieves linear computational complexity.
Outperforms or matches recent transformer models.
Enhances performance on classification, detection, and segmentation.
Abstract
Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗kadirnar/timm_model_listmodel· ♡ 1♡ 1
- 🤗timm/pvt_v2_b0.in1kmodel· 2.3k dl· ♡ 12.3k dl♡ 1
- 🤗timm/pvt_v2_b1.in1kmodel· 920 dl920 dl
- 🤗timm/pvt_v2_b2.in1kmodel· 8.1k dl· ♡ 18.1k dl♡ 1
- 🤗timm/pvt_v2_b2_li.in1kmodel· 1.2k dl1.2k dl
- 🤗timm/pvt_v2_b3.in1kmodel· 362 dl362 dl
- 🤗timm/pvt_v2_b4.in1kmodel· 460 dl· ♡ 1460 dl♡ 1
- 🤗timm/pvt_v2_b5.in1kmodel· 475 dl· ♡ 1475 dl♡ 1
- 🤗FoamoftheSea/pvt_v2_b0model· 15 dl15 dl
- 🤗OpenGVLab/pvt_v2_b0model· 3.6k dl· ♡ 33.6k dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Depthwise Convolution · Stochastic Depth · Pyramid Vision Transformer v2 · Swin Transformer
