Better plain ViT baselines for ImageNet-1k
Lucas Beyer, Xiaohua Zhai, Alexander Kolesnikov

TL;DR
This paper demonstrates that simple modifications to the vanilla Vision Transformer training setup significantly boost ImageNet-1k performance, achieving results comparable to ResNet50 with less complex regularization.
Contribution
It introduces minor but effective training modifications that improve plain ViT models without sophisticated regularization, setting new strong baselines.
Findings
90 epochs of training achieve over 76% accuracy
300 epochs of training reach 80% accuracy
Plain ViT matches ResNet50 performance with simple training tweaks
Abstract
It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data. Surprisingly, we find this is not the case and standard data augmentation is sufficient. This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models. Notably, 90 epochs of training surpass 76% top-1 accuracy in under seven hours on a TPUv3-8, similar to the classic ResNet50 baseline, and 300 epochs of training reach 80% in less than one day.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · COVID-19 diagnosis using AI · Medical Image Segmentation Techniques
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Multi-Head Attention · Layer Normalization · Residual Connection · Softmax · Label Smoothing · Adam · Position-Wise Feed-Forward Layer
