Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective   with Transformers

Sixiao Zheng; Jiachen Lu; Hengshuang Zhao; Xiatian Zhu; Zekun Luo,; Yabiao Wang; Yanwei Fu; Jianfeng Feng; Tao Xiang; Philip H.S. Torr; Li Zhang

arXiv:2012.15840·cs.CV·July 27, 2021·198 cites

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo,, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang

PDF

Open Access 5 Repos 1 Models

TL;DR

This paper introduces a novel transformer-based approach for semantic segmentation, treating it as a sequence-to-sequence task, which outperforms traditional convolutional methods on several benchmarks.

Contribution

Proposes SETR, a pure transformer model for semantic segmentation, offering an alternative to encoder-decoder architectures and achieving state-of-the-art results.

Findings

01

SETR achieves 50.28% mIoU on ADE20K

02

SETR attains 55.83% mIoU on Pascal Context

03

First place on ADE20K test leaderboard at submission time

Abstract

Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
mccaly/test2
model· 12 dl· ♡ 1
12 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Softmax · Residual Connection · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Segmentation Transformer · Max Pooling