SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation
Guoan Xu, Jiaming Chen, Wenfeng Huang, Wenjing Jia, Guangwei Gao, and Guo-Jun Qi

TL;DR
SCASeg introduces a novel, efficient decoder for semantic segmentation that leverages strip cross-attention and hierarchical feature integration to improve performance and computational efficiency.
Contribution
The paper proposes SCASeg, a decoder head with strip cross-attention and cross-layer blocks, optimized for semantic segmentation tasks, outperforming existing architectures.
Findings
Outperforms leading segmentation architectures on multiple benchmarks.
Reduces memory usage and increases inference speed compared to vanilla cross-attention.
Effectively captures global and local context dependencies across layers.
Abstract
The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants widely validated across various downstream tasks, including semantic segmentation. However, as general-purpose visual encoders, ViT backbones often do not fully address the specific requirements of task decoders, highlighting opportunities for designing decoders optimized for efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head specifically designed for semantic segmentation. Instead of relying on the conventional skip connections, we utilize lateral connections between encoder and decoder stages, leveraging encoder features as Queries in cross-attention modules. Additionally, we introduce a Cross-Layer Block (CLB) that integrates hierarchical feature maps from various encoder and decoder stages to form a unified representation for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
