SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation

Guoan Xu; Jiaming Chen; Wenfeng Huang; Wenjing Jia; Guangwei Gao; and Guo-Jun Qi

arXiv:2411.17061·cs.CV·April 24, 2026

SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation

Guoan Xu, Jiaming Chen, Wenfeng Huang, Wenjing Jia, Guangwei Gao, and Guo-Jun Qi

PDF

TL;DR

SCASeg introduces a novel, efficient decoder for semantic segmentation that leverages strip cross-attention and hierarchical feature integration to improve performance and computational efficiency.

Contribution

The paper proposes SCASeg, a decoder head with strip cross-attention and cross-layer blocks, optimized for semantic segmentation tasks, outperforming existing architectures.

Findings

01

Outperforms leading segmentation architectures on multiple benchmarks.

02

Reduces memory usage and increases inference speed compared to vanilla cross-attention.

03

Effectively captures global and local context dependencies across layers.

Abstract

The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants widely validated across various downstream tasks, including semantic segmentation. However, as general-purpose visual encoders, ViT backbones often do not fully address the specific requirements of task decoders, highlighting opportunities for designing decoders optimized for efficient semantic segmentation. This paper proposes Strip Cross-Attention (SCASeg), an innovative decoder head specifically designed for semantic segmentation. Instead of relying on the conventional skip connections, we utilize lateral connections between encoder and decoder stages, leveraging encoder features as Queries in cross-attention modules. Additionally, we introduce a Cross-Layer Block (CLB) that integrates hierarchical feature maps from various encoder and decoder stages to form a unified representation for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.