TL;DR
CASWiT is a novel dual-branch transformer architecture that enhances ultra-high-resolution semantic segmentation by integrating multi-scale context through stage-wise cross-attention and a masked reconstruction pretraining strategy.
Contribution
The paper introduces CASWiT, a stage-wise attention-based transformer model with a new pretraining method, improving UHR segmentation accuracy and boundary quality in remote sensing images.
Findings
CASWiT achieves 66.37% mIoU on FLAIR-HUB with RGB-only input.
CASWiT outperforms strong RGB baselines in boundary quality.
CASWiT transfers effectively to medical UHR segmentation benchmarks.
Abstract
Semantic ultra-high-resolution (UHR) image segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models remain challenging in this setting because memory grows quadratically with the number of tokens, limiting either spatial resolution or contextual scope. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch Swin-based architecture that injects low-resolution contextual information into fine-grained high-resolution features through lightweight stage-wise cross-attention. To strengthen cross-scale learning, we also propose a SimMIM-style pretraining strategy based on masked reconstruction of the high-resolution image. Extensive experiments on the large-scale FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Under our RGB-only UHR protocol, CASWiT reaches 66.37% mIoU with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
