Self-attention on Multi-Shifted Windows for Scene Segmentation
Litao Yu, Zhibin Li, Jian Zhang, Qiang Wu

TL;DR
This paper introduces a novel approach using self-attention on multi-shifted windows within Swin Transformer models to improve scene segmentation by effectively capturing multi-scale features, achieving promising results on multiple datasets.
Contribution
It proposes three strategies for aggregating multi-scale self-attention features in Swin Transformer-based models for scene segmentation, discarding convolution operations.
Findings
Achieves state-of-the-art performance on four public datasets.
Demonstrates the effectiveness of multi-scale self-attention in dense prediction.
Outperforms existing methods with simple multi-scale feature aggregation.
Abstract
Scene segmentation in images is a fundamental yet challenging problem in visual content understanding, which is to learn a model to assign every image pixel to a categorical label. One of the challenges for this learning task is to consider the spatial and semantic relationships to obtain descriptive feature representations, so learning the feature maps from multiple scales is a common practice in scene segmentation. In this paper, we explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features, then propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction. Our design is based on the recently proposed Swin Transformer models, which totally discards convolution operations. With the simple yet effective multi-scale feature learning and aggregation, our models achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Label Smoothing · Convolution
