Swin Transformer V2: Scaling Up Capacity and Resolution
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie and, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo

TL;DR
This paper introduces Swin Transformer V2, a large-scale vision model with novel training techniques that improve stability, resolution transfer, and reduce data needs, achieving state-of-the-art results efficiently.
Contribution
It proposes new methods for training stability, resolution transfer, and self-supervised pre-training, enabling the training of the largest dense vision model to date with high efficiency.
Findings
Trained a 3 billion-parameter Swin Transformer V2 model.
Achieved new records on four major vision benchmarks.
Reduced training data and time by 40 times compared to previous models.
Abstract
Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗laion/CLIP-ViT-L-14-laion2B-s32B-b82Kmodel· 275k dl· ♡ 63275k dl♡ 63
- 🤗microsoft/swinv2-tiny-patch4-window8-256model· 14k dl· ♡ 1114k dl♡ 11
- 🤗microsoft/swinv2-tiny-patch4-window16-256model· 382k dl· ♡ 13382k dl♡ 13
- 🤗microsoft/swinv2-small-patch4-window8-256model· 445 dl445 dl
- 🤗microsoft/swinv2-small-patch4-window16-256model· 1.4k dl· ♡ 11.4k dl♡ 1
- 🤗microsoft/swinv2-base-patch4-window8-256model· 3.2k dl· ♡ 73.2k dl♡ 7
- 🤗microsoft/swinv2-base-patch4-window16-256model· 7.6k dl· ♡ 57.6k dl♡ 5
- 🤗microsoft/swinv2-base-patch4-window12-192-22kmodel· 6.2k dl· ♡ 46.2k dl♡ 4
- 🤗microsoft/swinv2-large-patch4-window12-192-22kmodel· 3.1k dl· ♡ 103.1k dl♡ 10
- 🤗microsoft/swinv2-base-patch4-window12to16-192to256-22kto1k-ftmodel· 167 dl· ♡ 1167 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding
