SimViT: Exploring a Simple Vision Transformer with sliding windows
Gang Li, Di Xu, Xing Cheng, Lingyu Si, Changwen Zheng

TL;DR
SimViT is a simple vision Transformer that incorporates local spatial structure using sliding windows and a novel attention mechanism, achieving high accuracy with fewer parameters for image recognition tasks.
Contribution
Introduces SimViT, a vision Transformer with Multi-head Central Self-Attention and sliding windows to better capture local relations and spatial structure.
Findings
SimViT-Micro achieves 71.1% top-1 accuracy on ImageNet-1k with only 3.3M parameters.
SimViT outperforms some existing models in efficiency and effectiveness.
The model is suitable as a general-purpose backbone for various vision tasks.
Abstract
Although vision Transformers have achieved excellent performance as backbone models in many vision tasks, most of them intend to capture global relations of all tokens in an image or a window, which disrupts the inherent spatial and local correlations between patches in 2D structure. In this paper, we introduce a simple vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers. Specifically, we introduce Multi-head Central Self-Attention(MCSA) instead of conventional Multi-head Self-Attention to capture highly local relations. The introduction of sliding windows facilitates the capture of spatial structure. Meanwhile, SimViT extracts multi-scale hierarchical features from different layers for dense prediction tasks. Extensive experiments show the SimViT is effective and efficient as a general-purpose backbone model for various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Brain Tumor Detection and Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Label Smoothing · Byte Pair Encoding · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Adam
