VPNeXt -- Rethinking Dense Decoding for Plain Vision Transformer
Xikai Tang, Ye Huang, Guangqiang Yin, Lixin Duan

TL;DR
VPNeXt introduces a simplified yet effective dense decoding approach for Plain Vision Transformers, utilizing novel modules to improve semantic segmentation performance and surpass existing benchmarks.
Contribution
The paper proposes VPNeXt, a new model that simplifies dense decoding in ViT by replacing complex architectures with the VCR and ViTUp modules, achieving state-of-the-art results.
Findings
Achieved state-of-the-art performance on semantic segmentation tasks.
Significantly surpassed the mIoU barrier on VOC2012 dataset.
Validated effectiveness through ablation studies and visualizations.
Abstract
We present VPNeXt, a new and simple model for the Plain Vision Transformer (ViT). Unlike the many related studies that share the same homogeneous paradigms, VPNeXt offers a fresh perspective on dense representation based on ViT. In more detail, the proposed VPNeXt addressed two concerns about the existing paradigm: (1) Is it necessary to use a complex Transformer Mask Decoder architecture to obtain good representations? (2) Does the Plain ViT really need to depend on the mock pyramid feature for upsampling? For (1), we investigated the potential underlying reasons that contributed to the effectiveness of the Transformer Decoder and introduced the Visual Context Replay (VCR) to achieve similar effects efficiently. For (2), we introduced the ViTUp module. This module fully utilizes the previously overlooked ViT real pyramid feature to achieve better upsampling results compared to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Retinal Imaging and Analysis · CCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer
