Semantic-Aware Local-Global Vision Transformer
Jiatong Zhang, Zengwei Yao, Fanglin Chen, Guangming Lu, and Wenjie Pei

TL;DR
The paper introduces SALG, a vision transformer that incorporates unsupervised semantic segmentation and local-global attention mechanisms, improving feature learning especially in small-scale models.
Contribution
SALG advances vision transformers by integrating semantic priors through unsupervised segmentation and combining local and global attention for enhanced feature representation.
Findings
Outperforms other vision Transformers on various tasks.
Excels particularly in small-scale model scenarios.
Demonstrates the effectiveness of semantic-aware local-global modeling.
Abstract
Vision Transformers have achieved remarkable progresses, among which Swin Transformer has demonstrated the tremendous potential of Transformer for vision tasks. It surmounts the key challenge of high computational complexity by performing local self-attention within shifted windows. In this work we propose the Semantic-Aware Local-Global Vision Transformer (SALG), to further investigate two potential improvements towards Swin Transformer. First, unlike Swin Transformer that performs uniform partition to produce equal size of regular windows for local self-attention, our SALG performs semantic segmentation in an unsupervised way to explore the underlying semantic priors in the image. As a result, each segmented region can correspond to a semantically meaningful part in the image, potentially leading to more effective features within each of segmented regions. Second, instead of only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Stochastic Depth · Softmax · Adam · Dropout · Byte Pair Encoding · Swin Transformer · Position-Wise Feed-Forward Layer · Label Smoothing
