Structured Initialization for Vision Transformers
Jianqiao Zheng, Xueqian Li, Hemanth Saratchandran, Simon Lucey

TL;DR
This paper introduces a novel initialization method for Vision Transformers that embeds CNN-like inductive biases through initialization alone, improving performance on small datasets while maintaining scalability.
Contribution
It proposes a new initialization strategy for ViTs that enhances small-data performance without architectural changes, outperforming existing heuristics.
Findings
Significantly outperforms standard ViT initialization on small/medium benchmarks.
Maintains competitive performance on large-scale datasets like ImageNet-1K.
Easily integrates into various transformer architectures with consistent gains.
Abstract
Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · CCD and CMOS Imaging Sensors · Robotics and Sensor-Based Localization
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Stochastic Depth · Residual Connection · Dense Connections · Swin Transformer · Average Pooling
