Semi-Supervised Vision Transformers
Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang

TL;DR
This paper introduces Semiformer, a semi-supervised learning framework combining transformers and CNNs, which significantly improves Vision Transformer performance on limited labeled data, achieving state-of-the-art results on ImageNet.
Contribution
The paper proposes Semiformer, a novel semi-supervised framework that integrates transformer and convolutional streams with a fusion module, enhancing Vision Transformer training with limited labeled data.
Findings
Semiformer achieves 75.5% top-1 accuracy on ImageNet.
It outperforms existing methods in semi-supervised vision tasks.
The framework is compatible with various transformer and CNN architectures.
Abstract
We study the training of Vision Transformers for semi-supervised image classification. Transformers have recently demonstrated impressive performance on a multitude of supervised learning tasks. Surprisingly, we show Vision Transformers perform significantly worse than Convolutional Neural Networks when only a small set of labeled data is available. Inspired by this observation, we introduce a joint semi-supervised learning framework, Semiformer, which contains a transformer stream, a convolutional stream and a carefully designed fusion module for knowledge sharing between these streams. The convolutional stream is trained on limited labeled data and further used to generate pseudo labels to supervise the training of the transformer stream on unlabeled data. Extensive experiments on ImageNet demonstrate that Semiformer achieves 75.5% top-1 accuracy, outperforming the state-of-the-art by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Residual Connection · Dense Connections
