Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training
Haofei Zhang, Jiarui Duan, Mengqi Xue, Jie Song, Li Sun, Mingli Song

TL;DR
This paper introduces a bootstrapping training method that incorporates CNNs' inductive biases into Vision Transformers, enabling ViTs to perform well with limited data and reducing reliance on extensive pre-training.
Contribution
A novel bootstrapping training algorithm that jointly optimizes ViTs and CNN-based agents with weight sharing, embedding inductive biases into ViTs during training.
Findings
ViTs with inductive biases converge faster on small datasets.
Proposed method outperforms CNNs with fewer parameters.
Effective on CIFAR-10/100 and ImageNet-1k with limited data.
Abstract
Recently, vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in the realm of computer vision (CV). With the general-purpose Transformer architecture replacing the hard-coded inductive biases of convolution, ViTs have surpassed CNNs, especially in data-sufficient circumstances. However, ViTs are prone to over-fit on small datasets and thus rely on large-scale pre-training, which expends enormous time. In this paper, we strive to liberate ViTs from pre-training by introducing CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound and setting up more suitable optimization objectives. To begin with, an agent CNN is designed based on the given ViT with inductive biases. Then a bootstrapping training algorithm is proposed to jointly optimize the agent and ViT with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Brain Tumor Detection and Classification
MethodsAttention Is All You Need · Linear Layer · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization · Byte Pair Encoding · Label Smoothing · Multi-Head Attention · Absolute Position Encodings · Softmax
