Structured Initialization for Attention in Vision Transformers
Jianqiao Zheng, Xueqian Li, Simon Lucey

TL;DR
This paper introduces a structured initialization method for Vision Transformers that leverages architectural biases similar to CNNs, enabling effective training on small datasets and achieving state-of-the-art results.
Contribution
It reinterprets CNN inductive biases as initialization strategies for ViTs, improving data efficiency on small datasets without sacrificing scalability.
Findings
Achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and SVHN.
Structured initialization enables ViTs to perform well on small datasets.
Random impulse filters can match learned filters in CNNs.
Abstract
The training of vision transformer (ViT) networks on small-scale datasets poses a significant challenge. By contrast, convolutional neural networks (CNNs) have an architectural inductive bias enabling them to perform well on such problems. In this paper, we argue that the architectural bias inherent to CNNs can be reinterpreted as an initialization bias within ViT. This insight is significant as it empowers ViTs to perform equally well on small-scale problems while maintaining their flexibility for large-scale applications. Our inspiration for this ``structured'' initialization stems from our empirical observation that random impulse filters can achieve comparable performance to learned filters within CNNs. Our approach achieves state-of-the-art performance for data-efficient ViT learning across numerous benchmarks including CIFAR-10, CIFAR-100, and SVHN.
Peer Reviews
Decision·Submitted to ICLR 2025
1.Theoretical Foundation: The structured initialization method is based on solid theoretical analysis rather than just empirical results, providing a strong rationale for its effectiveness. 2.Performance Improvements: The method consistently shows significant performance improvements over conventional ViT initialization methods in small-scale datasets, which is a notable achievement.
1. In terms of innovation, the Transformer architecture was initially designed to minimize inductive bias. The author's attempt to incorporate structural biases from CNNs into the Transformer seems to go against the original intent of the Transformer design, which could be seen as a step backward for the evolution of Transformer models. 2. The variety of experimental backbones is somewhat limited. It would be beneficial to conduct experiments with DeiT or Swin-Transformer to compare results. Fu
- The paper is easy to follow. - As much work has been trying to introduce convolutional design into the ViT model, this paper provides an interesting viewpoint that initializing the attention map as CNNs can also help to introduce the inductive bias and subsequentially improve the performance of trained ViT on small-scale datasets. - A theoretical explanation is provided to show the connection between the structural initialization in ViT and inductive bias in CNNs. - Some special designs like m
- The fundamental approach of forcing attention maps to mimic convolutional kernels seems to contradict the core advantage of attention mechanisms, as their advantage is to learn flexible, dynamic global relationships. It would be better to justify why structured initialization is preferred over simply incorporating convolutional blocks into the architecture, which would be a more straightforward solution. - It would be better to provide more analysis of why this approach is better compared to
Structured architecture achieves good performance across both small and large datasets, which demonstrates its scalability and flexibility.
1. The core argument of the method is that the convolutional structure can be transferred to the attention mechanism in transformers by initializing the attention maps with random impulse filters. However, this analogy between convolutional layers in CNNs and the attention mechanism in ViTs may be overly simplistic. CNNs' convolutional filters are spatially local and fixed in structure, while attention in ViTs is meant to capture long-range dependencies and is more flexible. This difference is c
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Layer Normalization · Multi-Head Attention · Dense Connections · Vision Transformer
