Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention
Haotian Yan, Chuang Zhang, Ming Wu

TL;DR
The Lawin Transformer introduces a multi-scale window attention mechanism combined with spatial pyramid pooling to enhance semantic segmentation performance and efficiency in vision transformers.
Contribution
It proposes large window attention and LawinASPP decoder, enabling multi-scale context capture with minimal computational overhead in semantic segmentation ViTs.
Findings
Achieves state-of-the-art results on Cityscapes, ADE20K, and COCO-Stuff datasets.
Demonstrates improved efficiency over existing methods.
Sets new benchmarks in semantic segmentation performance.
Abstract
Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Softmax · Adam · Position-Wise Feed-Forward Layer · Dropout · Multi-Head Attention
