Structured Initialization for Vision Transformers

Jianqiao Zheng; Xueqian Li; Hemanth Saratchandran; Simon Lucey

arXiv:2505.19985·cs.CV·December 9, 2025

Structured Initialization for Vision Transformers

Jianqiao Zheng, Xueqian Li, Hemanth Saratchandran, Simon Lucey

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel initialization method for Vision Transformers that embeds CNN-like inductive biases through initialization alone, improving performance on small datasets while maintaining scalability.

Contribution

It proposes a new initialization strategy for ViTs that enhances small-data performance without architectural changes, outperforming existing heuristics.

Findings

01

Significantly outperforms standard ViT initialization on small/medium benchmarks.

02

Maintains competitive performance on large-scale datasets like ImageNet-1K.

03

Easily integrates into various transformer architectures with consistent gains.

Abstract

Convolutional Neural Networks (CNNs) inherently encode strong inductive biases, enabling effective generalization on small-scale datasets. In this paper, we propose integrating this inductive bias into ViTs, not through an architectural intervention but solely through initialization. The motivation here is to have a ViT that can enjoy strong CNN-like performance when data assets are small, but can still scale to ViT-like performance as the data expands. Our approach is motivated by our empirical results that random impulse filters can achieve commensurate performance to learned filters within a CNN. We improve upon current ViT initialization strategies, which typically rely on empirical heuristics such as using attention weights from pretrained models or focusing on the distribution of attention weights without enforcing structures. Empirical results demonstrate that our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Structured Initialization for Vision Transformers· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging · CCD and CMOS Imaging Sensors · Robotics and Sensor-Based Localization

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Stochastic Depth · Residual Connection · Dense Connections · Swin Transformer · Average Pooling