Self-Supervised Learning with Swin Transformers
Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao,, Han Hu

TL;DR
This paper introduces MoBY, a self-supervised learning method using Swin Transformers, achieving high accuracy on ImageNet-1K and enabling evaluation on dense prediction tasks, thus broadening the assessment of Transformer-based models.
Contribution
It presents a simple yet effective self-supervised learning approach with Swin Transformers, extending evaluation to downstream dense prediction tasks.
Findings
Achieves 72.8% and 75.0% top-1 accuracy on ImageNet-1K with DeiT-S and Swin-T.
Enables evaluation of learned representations on object detection and segmentation.
Performs slightly better than recent methods like MoCo v3 and DINO with fewer tricks.
Abstract
We are witnessing a modeling shift from CNN to Transformers in computer vision. In this work, we present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture. The approach basically has no new inventions, which is combined from MoCo v2 and BYOL and tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation: 72.8% and 75.0% top-1 accuracy using DeiT-S and Swin-T, respectively, by 300-epoch training. The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks. More importantly, the general-purpose Swin Transformer backbone enables us to also evaluate the learnt representations on downstream tasks such as object detection and semantic segmentation, in contrast to a few recent approaches built on ViT/DeiT which only report linear evaluation results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Vision Transformer · Linear Layer · MoCo v3 · AdamW · DropPath · MoBY · Absolute Position Encodings · Position-Wise Feed-Forward Layer
