Semi-supervised Vision Transformers at Scale
Zhaowei Cai, Avinash Ravichandran, Paolo Favaro, Manchen Wang, Davide, Modolo, Rahul Bhotika, Zhuowen Tu, Stefano Soatto

TL;DR
This paper introduces Semi-ViT, a semi-supervised learning pipeline for vision transformers that combines self-supervised pre-training, EMA-Teacher fine-tuning, and a probabilistic pseudo mixup, achieving high accuracy with limited labels.
Contribution
The paper proposes a novel semi-supervised learning framework for vision transformers, including a probabilistic pseudo mixup and an EMA-Teacher approach, improving stability and scalability.
Findings
Semi-ViT achieves 80% top-1 accuracy on ImageNet with only 1% labels.
Semi-ViT outperforms CNN-based semi-supervised methods.
Scalable to large vision transformer models.
Abstract
We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised fine-tuning stage, we adopt an exponential moving average (EMA)-Teacher framework instead of the popular FixMatch, since the former is more stable and delivers higher accuracy for semi-supervised vision transformers. In addition, we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled samples and their pseudo labels for improved regularization, which is important for training ViTs with weak inductive bias. Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Infrared Target Detection Methodologies
Methods1x1 Convolution · Convolution · Dropout · Inception-A · Average Pooling · Max Pooling · Inception-C · Reduction-A · FixMatch · Inception-B
