Self-attention on Multi-Shifted Windows for Scene Segmentation

Litao Yu; Zhibin Li; Jian Zhang; Qiang Wu

arXiv:2207.04403·cs.CV·July 12, 2022

Self-attention on Multi-Shifted Windows for Scene Segmentation

Litao Yu, Zhibin Li, Jian Zhang, Qiang Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel approach using self-attention on multi-shifted windows within Swin Transformer models to improve scene segmentation by effectively capturing multi-scale features, achieving promising results on multiple datasets.

Contribution

It proposes three strategies for aggregating multi-scale self-attention features in Swin Transformer-based models for scene segmentation, discarding convolution operations.

Findings

01

Achieves state-of-the-art performance on four public datasets.

02

Demonstrates the effectiveness of multi-scale self-attention in dense prediction.

03

Outperforms existing methods with simple multi-scale feature aggregation.

Abstract

Scene segmentation in images is a fundamental yet challenging problem in visual content understanding, which is to learn a model to assign every image pixel to a categorical label. One of the challenges for this learning task is to consider the spatial and semantic relationships to obtain descriptive feature representations, so learning the feature maps from multiple scales is a common practice in scene segmentation. In this paper, we explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features, then propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction. Our design is based on the recently proposed Swin Transformer models, which totally discards convolution operations. With the simple yet effective multi-scale feature learning and aggregation, our models achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yutao1008/mswin
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Label Smoothing · Convolution