Early Convolutions Help Transformers See Better

Tete Xiao; Mannat Singh; Eric Mintun; Trevor Darrell; Piotr Doll\'ar,; Ross Girshick

arXiv:2106.14881·cs.CV·October 27, 2021·352 cites

Early Convolutions Help Transformers See Better

Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll\'ar,, Ross Girshick

PDF

Open Access 1 Repo 1 Video

TL;DR

Replacing the patchify stem in Vision Transformers with a lightweight convolutional stem significantly improves training stability and accuracy across various models and datasets, addressing optimization challenges inherent in the original design.

Contribution

This work demonstrates that a simple convolutional stem enhances ViT optimization and performance, providing a robust architectural modification over the original patchify approach.

Findings

01

Convolutional stem improves ViT training stability.

02

Convolutional stem increases top-1 accuracy by 1-2%.

03

Performance gains are consistent across model sizes and datasets.

Abstract

Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p*p convolution (p=16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3*3 convolutions. While the vast…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jack-Etheredge/early_convolutions_vit_pytorch
pytorch

Videos

Early Convolutions Help Transformers See Better· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Explainable Artificial Intelligence (XAI) · Cell Image Analysis Techniques

MethodsConvolution