DeiT III: Revenge of the ViT

Hugo Touvron; Matthieu Cord; Herv\'e J\'egou

arXiv:2204.07118·cs.CV·April 15, 2022

DeiT III: Revenge of the ViT

Hugo Touvron, Matthieu Cord, Herv\'e J\'egou

PDF

5 Repos 10 Models

TL;DR

This paper revisits supervised training of Vision Transformers, introducing a simplified data-augmentation recipe that significantly improves performance and serves as a strong baseline for future self-supervised methods.

Contribution

It presents a new, simple supervised training procedure for ViT that outperforms previous methods and aligns its performance with more recent architectures.

Findings

01

Outperforms previous supervised ViT training recipes.

02

Achieves performance comparable to recent architectures.

03

Provides better baselines for self-supervised ViT approaches.

Abstract

A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · FixRes · 3-Augment · LayerScale · Adam · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections