ScalableViT: Rethinking the Context-oriented Generalization of Vision   Transformer

Rui Yang; Hailong Ma; Jie Wu; Yansong Tang; Xuefeng Xiao; Min Zheng,; Xiu Li

arXiv:2203.10790·cs.CV·July 19, 2022

ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer

Rui Yang, Hailong Ma, Jie Wu, Yansong Tang, Xuefeng Xiao, Min Zheng,, Xiu Li

PDF

Open Access 2 Repos

TL;DR

ScalableViT introduces a scalable self-attention mechanism and an interactive window-based approach to improve global context understanding and object sensitivity in vision transformers, achieving state-of-the-art results.

Contribution

It proposes a novel scalable self-attention and an interactive window-based module, enhancing context-oriented generalization in vision transformers.

Findings

01

Outperforms Twins-SVT-S by 1.4% on ImageNet-1K

02

Outperforms Swin-T by 1.8% on ImageNet-1K

03

Achieves state-of-the-art performance in vision tasks

Abstract

The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global representations. To mitigate this issue, we propose a Scalable Self-Attention (SSA) mechanism that leverages two scaling factors to release dimensions of query, key, and value matrices while unbinding them with the input. This scalability fetches context-oriented generalization and enhances object sensitivity, which pushes the whole network into a more effective trade-off state between accuracy and cost. Furthermore, we propose an Interactive Window-based Self-Attention (IWSA), which establishes interaction between non-overlapping regions by re-merging independent value tokens and aggregating spatial information from adjacent windows. By stacking the SSA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Neural Networks and Reservoir Computing · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Dropout · Layer Normalization