Better plain ViT baselines for ImageNet-1k

Lucas Beyer; Xiaohua Zhai; Alexander Kolesnikov

arXiv:2205.01580·cs.CV·May 4, 2022·49 cites

Better plain ViT baselines for ImageNet-1k

Lucas Beyer, Xiaohua Zhai, Alexander Kolesnikov

PDF

Open Access 5 Repos

TL;DR

This paper demonstrates that simple modifications to the vanilla Vision Transformer training setup significantly boost ImageNet-1k performance, achieving results comparable to ResNet50 with less complex regularization.

Contribution

It introduces minor but effective training modifications that improve plain ViT models without sophisticated regularization, setting new strong baselines.

Findings

01

90 epochs of training achieve over 76% accuracy

02

300 epochs of training reach 80% accuracy

03

Plain ViT matches ResNet50 performance with simple training tweaks

Abstract

It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data. Surprisingly, we find this is not the case and standard data augmentation is sufficient. This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models. Notably, 90 epochs of training surpass 76% top-1 accuracy in under seven hours on a TPUv3-8, similar to the classic ResNet50 baseline, and 300 epochs of training reach 80% in less than one day.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · COVID-19 diagnosis using AI · Medical Image Segmentation Techniques

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Multi-Head Attention · Layer Normalization · Residual Connection · Softmax · Label Smoothing · Adam · Position-Wise Feed-Forward Layer