SegViTv2: Exploring Efficient and Continual Semantic Segmentation with   Plain Vision Transformers

Bowen Zhang; Liyang Liu; Minh Hieu Phan; Zhi Tian; Chunhua Shen; Yifan; Liu

arXiv:2306.06289·cs.CV·August 31, 2023·1 cites

SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers

Bowen Zhang, Liyang Liu, Minh Hieu Phan, Zhi Tian, Chunhua Shen, Yifan, Liu

PDF

Open Access 1 Repo 1 Models

TL;DR

SegViTv2 introduces a lightweight, efficient Vision Transformer-based framework for semantic segmentation, featuring a novel Attention-to-Mask decoder and a cost-effective encoder structure, with strong performance and continual learning capabilities.

Contribution

The paper presents SegViTv2, a novel semantic segmentation model that significantly reduces computational cost while improving accuracy, and extends it for continual learning with minimal forgetting.

Findings

01

Outperforms UPerNet with various ViT backbones

02

Reduces encoder computation by up to 50%

03

Achieves state-of-the-art results on ADE20k, COCO-Stuff-10k, PASCAL-Context

Abstract

This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework and introduces \textbf{SegViTv2}. In this study, we introduce a novel Attention-to-Mask (\atm) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about $5%$ of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a \emph{Shrunk++} structure that incorporates edge-aware query-based down-sampling (EQD) and query-based upsampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to $50%$ while maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zbwxp/SegVit
pytorchOfficial

Models

🤗
Akide/SegViTv1
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications