Swin Transformer V2: Scaling Up Capacity and Resolution

Ze Liu; Han Hu; Yutong Lin; Zhuliang Yao; Zhenda Xie and; Yixuan Wei; Jia Ning; Yue Cao; Zheng Zhang; Li Dong; Furu Wei; and Baining Guo

arXiv:2111.09883·cs.CV·April 12, 2022·127 cites

Swin Transformer V2: Scaling Up Capacity and Resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie and, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo

PDF

Open Access 5 Repos 10 Models

TL;DR

This paper introduces Swin Transformer V2, a large-scale vision model with novel training techniques that improve stability, resolution transfer, and reduce data needs, achieving state-of-the-art results efficiently.

Contribution

It proposes new methods for training stability, resolution transfer, and self-supervised pre-training, enabling the training of the largest dense vision model to date with high efficiency.

Findings

01

Trained a 3 billion-parameter Swin Transformer V2 model.

02

Achieved new records on four major vision benchmarks.

03

Reduced training data and time by 40 times compared to previous models.

Abstract

Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding