VPNeXt -- Rethinking Dense Decoding for Plain Vision Transformer

Xikai Tang; Ye Huang; Guangqiang Yin; Lixin Duan

arXiv:2502.16654·cs.CV·September 30, 2025

VPNeXt -- Rethinking Dense Decoding for Plain Vision Transformer

Xikai Tang, Ye Huang, Guangqiang Yin, Lixin Duan

PDF

Open Access

TL;DR

VPNeXt introduces a simplified yet effective dense decoding approach for Plain Vision Transformers, utilizing novel modules to improve semantic segmentation performance and surpass existing benchmarks.

Contribution

The paper proposes VPNeXt, a new model that simplifies dense decoding in ViT by replacing complex architectures with the VCR and ViTUp modules, achieving state-of-the-art results.

Findings

01

Achieved state-of-the-art performance on semantic segmentation tasks.

02

Significantly surpassed the mIoU barrier on VOC2012 dataset.

03

Validated effectiveness through ablation studies and visualizations.

Abstract

We present VPNeXt, a new and simple model for the Plain Vision Transformer (ViT). Unlike the many related studies that share the same homogeneous paradigms, VPNeXt offers a fresh perspective on dense representation based on ViT. In more detail, the proposed VPNeXt addressed two concerns about the existing paradigm: (1) Is it necessary to use a complex Transformer Mask Decoder architecture to obtain good representations? (2) Does the Plain ViT really need to depend on the mock pyramid feature for upsampling? For (1), we investigated the potential underlying reasons that contributed to the effectiveness of the Transformer Decoder and introduced the Visual Context Replay (VCR) to achieve similar effects efficiently. For (2), we introduced the ViTUp module. This module fully utilizes the previously overlooked ViT real pyramid feature to achieve better upsampling results compared to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Retinal Imaging and Analysis · CCD and CMOS Imaging Sensors

MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Byte Pair Encoding · Dense Connections · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer