SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers
Bowen Zhang, Liyang Liu, Minh Hieu Phan, Zhi Tian, Chunhua Shen, Yifan, Liu

TL;DR
SegViTv2 introduces a lightweight, efficient Vision Transformer-based framework for semantic segmentation, featuring a novel Attention-to-Mask decoder and a cost-effective encoder structure, with strong performance and continual learning capabilities.
Contribution
The paper presents SegViTv2, a novel semantic segmentation model that significantly reduces computational cost while improving accuracy, and extends it for continual learning with minimal forgetting.
Findings
Outperforms UPerNet with various ViT backbones
Reduces encoder computation by up to 50%
Achieves state-of-the-art results on ADE20k, COCO-Stuff-10k, PASCAL-Context
Abstract
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework and introduces \textbf{SegViTv2}. In this study, we introduce a novel Attention-to-Mask (\atm) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a \emph{Shrunk++} structure that incorporates edge-aware query-based down-sampling (EQD) and query-based upsampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to while maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
