CSWin Transformer: A General Vision Transformer Backbone with   Cross-Shaped Windows

Xiaoyi Dong; Jianmin Bao; Dongdong Chen; Weiming Zhang and; Nenghai Yu; Lu Yuan; Dong Chen; Baining Guo

arXiv:2107.00652·cs.CV·January 11, 2022·101 cites

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang and, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo

PDF

Open Access 5 Repos 1 Models

TL;DR

CSWin Transformer introduces a cross-shaped window self-attention mechanism and a hierarchical structure, achieving state-of-the-art performance on multiple vision tasks with efficient computation and flexible input resolution handling.

Contribution

It proposes a novel cross-shaped window self-attention and a hierarchical design, significantly improving vision transformer performance and efficiency over previous models.

Findings

01

Achieves 85.4% Top-1 accuracy on ImageNet-1K.

02

Surpasses previous SOTA Swin Transformer on COCO detection and ADE20K segmentation.

03

Demonstrates strong performance with larger pretraining datasets.

Abstract

We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
birder-project/cswin_transformer_s_eu-common
model· 39 dl
39 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Advanced Memory and Neural Computing

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Stochastic Depth · Swin Transformer · Adam · Byte Pair Encoding · Layer Normalization · Dropout