Early Convolutions Help Transformers See Better
Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll\'ar,, Ross Girshick

TL;DR
Replacing the patchify stem in Vision Transformers with a lightweight convolutional stem significantly improves training stability and accuracy across various models and datasets, addressing optimization challenges inherent in the original design.
Contribution
This work demonstrates that a simple convolutional stem enhances ViT optimization and performance, providing a robust architectural modification over the original patchify approach.
Findings
Convolutional stem improves ViT training stability.
Convolutional stem increases top-1 accuracy by 1-2%.
Performance gains are consistent across model sizes and datasets.
Abstract
Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p*p convolution (p=16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3*3 convolutions. While the vast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Explainable Artificial Intelligence (XAI) · Cell Image Analysis Techniques
MethodsConvolution
