Context-Aware Semantic Segmentation via Stage-Wise Attention

Antoine Carreaud; Elias Naha; Arthur Chansel; Nina Lahellec; Jan Skaloud; Adrien Gressin

arXiv:2601.11310·cs.CV·April 14, 2026

Context-Aware Semantic Segmentation via Stage-Wise Attention

Antoine Carreaud, Elias Naha, Arthur Chansel, Nina Lahellec, Jan Skaloud, Adrien Gressin

PDF

1 Repo 1 Models

TL;DR

CASWiT is a novel dual-branch transformer architecture that enhances ultra-high-resolution semantic segmentation by integrating multi-scale context through stage-wise cross-attention and a masked reconstruction pretraining strategy.

Contribution

The paper introduces CASWiT, a stage-wise attention-based transformer model with a new pretraining method, improving UHR segmentation accuracy and boundary quality in remote sensing images.

Findings

01

CASWiT achieves 66.37% mIoU on FLAIR-HUB with RGB-only input.

02

CASWiT outperforms strong RGB baselines in boundary quality.

03

CASWiT transfers effectively to medical UHR segmentation benchmarks.

Abstract

Semantic ultra-high-resolution (UHR) image segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models remain challenging in this setting because memory grows quadratically with the number of tokens, limiting either spatial resolution or contextual scope. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch Swin-based architecture that injects low-resolution contextual information into fine-grained high-resolution features through lightweight stage-wise cross-attention. To strengthen cross-scale learning, we also propose a SimMIM-style pretraining strategy based on masked reconstruction of the high-resolution image. Extensive experiments on the large-scale FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Under our RGB-only UHR protocol, CASWiT reaches 66.37% mIoU with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/collections/heig-vd-geo/caswit
github

Models

🤗
heig-vd-geo/CASWiT
model· ♡ 7
♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.