SimViT: Exploring a Simple Vision Transformer with sliding windows

Gang Li; Di Xu; Xing Cheng; Lingyu Si; Changwen Zheng

arXiv:2112.13085·cs.CV·December 28, 2021

SimViT: Exploring a Simple Vision Transformer with sliding windows

Gang Li, Di Xu, Xing Cheng, Lingyu Si, Changwen Zheng

PDF

Open Access 2 Repos

TL;DR

SimViT is a simple vision Transformer that incorporates local spatial structure using sliding windows and a novel attention mechanism, achieving high accuracy with fewer parameters for image recognition tasks.

Contribution

Introduces SimViT, a vision Transformer with Multi-head Central Self-Attention and sliding windows to better capture local relations and spatial structure.

Findings

01

SimViT-Micro achieves 71.1% top-1 accuracy on ImageNet-1k with only 3.3M parameters.

02

SimViT outperforms some existing models in efficiency and effectiveness.

03

The model is suitable as a general-purpose backbone for various vision tasks.

Abstract

Although vision Transformers have achieved excellent performance as backbone models in many vision tasks, most of them intend to capture global relations of all tokens in an image or a window, which disrupts the inherent spatial and local correlations between patches in 2D structure. In this paper, we introduce a simple vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers. Specifically, we introduce Multi-head Central Self-Attention(MCSA) instead of conventional Multi-head Self-Attention to capture highly local relations. The introduction of sliding windows facilitates the capture of spatial structure. Meanwhile, SimViT extracts multi-scale hierarchical features from different layers for dense prediction tasks. Extensive experiments show the SimViT is effective and efficient as a general-purpose backbone model for various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Brain Tumor Detection and Classification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Label Smoothing · Byte Pair Encoding · Softmax · Dense Connections · Position-Wise Feed-Forward Layer · Adam