Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers
Yuanduo Hong, Jue Wang, Weichao Sun, and Huihui Pan

TL;DR
This paper introduces PlainSeg, a minimalist yet high-performance semantic segmentation model using plain Vision Transformers, emphasizing simplicity, high-resolution features, and hierarchical features for improved efficiency and effectiveness.
Contribution
The paper presents PlainSeg, a simple, efficient baseline for semantic segmentation with plain ViTs, and provides insights into high-resolution features and hierarchical feature utilization.
Findings
PlainSeg achieves competitive performance on multiple benchmarks.
High-resolution features are key to high performance with simple up-sampling.
Hierarchical features further improve segmentation accuracy.
Abstract
In the wake of Masked Image Modeling (MIM), a diverse range of plain, non-hierarchical Vision Transformer (ViT) models have been pre-trained with extensive datasets, offering new paradigms and significant potential for semantic segmentation. Current state-of-the-art systems incorporate numerous inductive biases and employ cumbersome decoders. Building upon the original motivations of plain ViTs, which are simplicity and generality, we explore high-performance `minimalist' systems to this end. Our primary purpose is to provide simple and efficient baselines for practical semantic segmentation with plain ViTs. Specifically, we first explore the feasibility and methodology for achieving high-performance semantic segmentation using the last feature map. As a result, we introduce the PlainSeg, a model comprising only three 33 convolutions in addition to the transformer layers (either…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Residual Connection · Absolute Position Encodings · Adam · Byte Pair Encoding
