TL;DR
This paper revisits supervised training of Vision Transformers, introducing a simplified data-augmentation recipe that significantly improves performance and serves as a strong baseline for future self-supervised methods.
Contribution
It presents a new, simple supervised training procedure for ViT that outperforms previous methods and aligns its performance with more recent architectures.
Findings
Outperforms previous supervised ViT training recipes.
Achieves performance comparable to recent architectures.
Provides better baselines for self-supervised ViT approaches.
Abstract
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗kadirnar/timm_model_listmodel· ♡ 1♡ 1
- 🤗timm/deit3_base_patch16_224.fb_in1kmodel· 2.8k dl2.8k dl
- 🤗timm/deit3_base_patch16_224.fb_in22k_ft_in1kmodel· 549 dl· ♡ 1549 dl♡ 1
- 🤗timm/deit3_base_patch16_384.fb_in1kmodel· 699 dl699 dl
- 🤗timm/deit3_base_patch16_384.fb_in22k_ft_in1kmodel· 938 dl938 dl
- 🤗timm/deit3_huge_patch14_224.fb_in1kmodel· 230 dl230 dl
- 🤗timm/deit3_huge_patch14_224.fb_in22k_ft_in1kmodel· 138 dl138 dl
- 🤗timm/deit3_large_patch16_224.fb_in1kmodel· 559 dl559 dl
- 🤗timm/deit3_large_patch16_224.fb_in22k_ft_in1kmodel· 526 dl· ♡ 1526 dl♡ 1
- 🤗timm/deit3_large_patch16_384.fb_in1kmodel· 89 dl89 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · FixRes · 3-Augment · LayerScale · Adam · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections
