Lawin Transformer: Improving Semantic Segmentation Transformer with   Multi-Scale Representations via Large Window Attention

Haotian Yan; Chuang Zhang; Ming Wu

arXiv:2201.01615·cs.CV·August 10, 2023·48 cites

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Haotian Yan, Chuang Zhang, Ming Wu

PDF

Open Access 3 Repos

TL;DR

The Lawin Transformer introduces a multi-scale window attention mechanism combined with spatial pyramid pooling to enhance semantic segmentation performance and efficiency in vision transformers.

Contribution

It proposes large window attention and LawinASPP decoder, enabling multi-scale context capture with minimal computational overhead in semantic segmentation ViTs.

Findings

01

Achieves state-of-the-art results on Cityscapes, ADE20K, and COCO-Stuff datasets.

02

Demonstrates improved efficiency over existing methods.

03

Sets new benchmarks in semantic segmentation performance.

Abstract

Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the $large window attention$ to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Softmax · Adam · Position-Wise Feed-Forward Layer · Dropout · Multi-Head Attention