SegViT: Semantic Segmentation with Plain Vision Transformers

Bowen Zhang; Zhi Tian; Quan Tang; Xiangxiang Chu; Xiaolin; Wei; Chunhua Shen; Yifan Liu

arXiv:2210.05844·cs.CV·December 13, 2022·75 cites

SegViT: Semantic Segmentation with Plain Vision Transformers

Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin, Wei, Chunhua Shen, Yifan Liu

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

SegViT introduces a novel attention-based approach for semantic segmentation using plain Vision Transformers, achieving state-of-the-art results and reducing computational costs with a new Shrunk structure.

Contribution

The paper presents the Attention-to-Mask module for segmentation and a Shrunk structure to lower computational costs in Vision Transformer-based models.

Findings

01

SegViT outperforms previous ViT-based segmentation methods on multiple datasets.

02

The ATM module effectively generates segmentation masks from attention maps.

03

The Shrunk structure reduces computation by up to 40% while maintaining performance.

Abstract

We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zbwxp/SegVit
pytorchOfficial

Models

🤗
Akide/SegViTv1
model

Videos

SegViT: Semantic Segmentation with Plain Vision Transformers· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning