ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer
Rui Yang, Hailong Ma, Jie Wu, Yansong Tang, Xuefeng Xiao, Min Zheng,, Xiu Li

TL;DR
ScalableViT introduces a scalable self-attention mechanism and an interactive window-based approach to improve global context understanding and object sensitivity in vision transformers, achieving state-of-the-art results.
Contribution
It proposes a novel scalable self-attention and an interactive window-based module, enhancing context-oriented generalization in vision transformers.
Findings
Outperforms Twins-SVT-S by 1.4% on ImageNet-1K
Outperforms Swin-T by 1.8% on ImageNet-1K
Achieves state-of-the-art performance in vision tasks
Abstract
The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global representations. To mitigate this issue, we propose a Scalable Self-Attention (SSA) mechanism that leverages two scaling factors to release dimensions of query, key, and value matrices while unbinding them with the input. This scalability fetches context-oriented generalization and enhances object sensitivity, which pushes the whole network into a more effective trade-off state between accuracy and cost. Furthermore, we propose an Interactive Window-based Self-Attention (IWSA), which establishes interaction between non-overlapping regions by re-merging independent value tokens and aggregating spatial information from adjacent windows. By stacking the SSA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Neural Networks and Reservoir Computing · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Dropout · Layer Normalization
