Semi-supervised Vision Transformers at Scale

Zhaowei Cai; Avinash Ravichandran; Paolo Favaro; Manchen Wang; Davide; Modolo; Rahul Bhotika; Zhuowen Tu; Stefano Soatto

arXiv:2208.05688·cs.CV·August 12, 2022·21 cites

Semi-supervised Vision Transformers at Scale

Zhaowei Cai, Avinash Ravichandran, Paolo Favaro, Manchen Wang, Davide, Modolo, Rahul Bhotika, Zhuowen Tu, Stefano Soatto

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Semi-ViT, a semi-supervised learning pipeline for vision transformers that combines self-supervised pre-training, EMA-Teacher fine-tuning, and a probabilistic pseudo mixup, achieving high accuracy with limited labels.

Contribution

The paper proposes a novel semi-supervised learning framework for vision transformers, including a probabilistic pseudo mixup and an EMA-Teacher approach, improving stability and scalability.

Findings

01

Semi-ViT achieves 80% top-1 accuracy on ImageNet with only 1% labels.

02

Semi-ViT outperforms CNN-based semi-supervised methods.

03

Scalable to large vision transformer models.

Abstract

We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised fine-tuning stage, we adopt an exponential moving average (EMA)-Teacher framework instead of the popular FixMatch, since the former is more stable and delivers higher accuracy for semi-supervised vision transformers. In addition, we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled samples and their pseudo labels for improved regularization, which is important for training ViTs with weak inductive bias. Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/semi-vit
pytorch

Videos

Semi-supervised Vision Transformers at Scale· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Infrared Target Detection Methodologies

Methods1x1 Convolution · Convolution · Dropout · Inception-A · Average Pooling · Max Pooling · Inception-C · Reduction-A · FixMatch · Inception-B