Multi-Scale Representations by Varying Window Attention for Semantic   Segmentation

Haotian Yan; Ming Wu; Chuang Zhang

arXiv:2404.16573·cs.CV·April 29, 2024

Multi-Scale Representations by Varying Window Attention for Semantic Segmentation

Haotian Yan, Ming Wu, Chuang Zhang

PDF

Open Access 2 Repos

TL;DR

This paper introduces VWA, a novel multi-scale attention mechanism that effectively captures multi-scale features in semantic segmentation without extra computational cost, and proposes VWFormer, a new decoder that outperforms existing methods.

Contribution

The paper presents VWA, a scale-varying window attention method, and VWFormer, a multi-scale decoder, advancing multi-scale learning in semantic segmentation with efficiency and improved accuracy.

Findings

01

VWA effectively captures multi-scale features without increasing computational cost.

02

VWFormer outperforms existing decoders like FPN and MLP with less computation.

03

The approach achieves 1.0%-2.5% higher mIoU on ADE20K compared to UPerNet.

Abstract

Multi-scale learning is central to semantic segmentation. We visualize the effective receptive field (ERF) of canonical multi-scale representations and point out two risks in learning them: scale inadequacy and field inactivation. A novel multi-scale learner, varying window attention (VWA), is presented to address these issues. VWA leverages the local window attention (LWA) and disentangles LWA into the query window and context window, allowing the context's scale to vary for the query to learn representations at multiple scales. However, varying the context to large-scale windows (enlarging ratio R) can significantly increase the memory footprint and computation cost (R^2 times larger than LWA). We propose a simple but professional re-scaling strategy to zero the extra induced cost without compromising performance. Consequently, VWA uses the same cost as LWA to overcome the receptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Semantic Web and Ontologies · Image Retrieval and Classification Techniques

Methods1x1 Convolution · Convolution · Feature Pyramid Network