Minimalist and High-Performance Semantic Segmentation with Plain Vision   Transformers

Yuanduo Hong; Jue Wang; Weichao Sun; and Huihui Pan

arXiv:2310.12755·cs.CV·October 20, 2023·2 cites

Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers

Yuanduo Hong, Jue Wang, Weichao Sun, and Huihui Pan

PDF

Open Access 1 Repo

TL;DR

This paper introduces PlainSeg, a minimalist yet high-performance semantic segmentation model using plain Vision Transformers, emphasizing simplicity, high-resolution features, and hierarchical features for improved efficiency and effectiveness.

Contribution

The paper presents PlainSeg, a simple, efficient baseline for semantic segmentation with plain ViTs, and provides insights into high-resolution features and hierarchical feature utilization.

Findings

01

PlainSeg achieves competitive performance on multiple benchmarks.

02

High-resolution features are key to high performance with simple up-sampling.

03

Hierarchical features further improve segmentation accuracy.

Abstract

In the wake of Masked Image Modeling (MIM), a diverse range of plain, non-hierarchical Vision Transformer (ViT) models have been pre-trained with extensive datasets, offering new paradigms and significant potential for semantic segmentation. Current state-of-the-art systems incorporate numerous inductive biases and employ cumbersome decoders. Building upon the original motivations of plain ViTs, which are simplicity and generality, we explore high-performance `minimalist' systems to this end. Our primary purpose is to provide simple and efficient baselines for practical semantic segmentation with plain ViTs. Specifically, we first explore the feasibility and methodology for achieving high-performance semantic segmentation using the last feature map. As a result, we introduce the PlainSeg, a model comprising only three 3 $\times$ 3 convolutions in addition to the transformer layers (either…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ydhonghit/plainseg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Residual Connection · Absolute Position Encodings · Adam · Byte Pair Encoding