TL;DR
This paper investigates unique properties of self-supervised Vision Transformers, revealing their superior semantic segmentation features and classification performance, and introduces DINO, a simple self-supervised method that enhances ViT performance.
Contribution
The paper uncovers that self-supervised ViTs encode explicit semantic information and introduces DINO, a novel self-distillation approach that improves ViT performance on ImageNet.
Findings
Self-supervised ViTs contain explicit semantic segmentation information.
Self-supervised ViTs achieve 78.3% top-1 accuracy with k-NN on ImageNet.
DINO improves ViT-Base linear evaluation to 80.1% top-1 accuracy.
Abstract
In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/dino-vitb16model· 390k dl· ♡ 111390k dl♡ 111
- 🤗facebook/dino-vitb8model· 1.8k dl· ♡ 181.8k dl♡ 18
- 🤗facebook/dino-vits16model· 86k dl· ♡ 1686k dl♡ 16
- 🤗facebook/dino-vits8model· 2.7k dl· ♡ 162.7k dl♡ 16
- 🤗probing-vits/vit-dino-base16model· ♡ 3♡ 3
- 🤗Ramos-Ramos/dino-resnet-50model· 106 dl· ♡ 1106 dl♡ 1
- 🤗Ramos-Ramos/vicreg-resnet-50model· 448 dl448 dl
- 🤗timm/vit_base_patch8_224.dinomodel· 20k dl· ♡ 220k dl♡ 2
- 🤗timm/vit_base_patch16_224.dinomodel· 81k dl· ♡ 681k dl♡ 6
- 🤗timm/vit_small_patch8_224.dinomodel· 7.7k dl· ♡ 27.7k dl♡ 2
Videos
DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Research Explained)· youtube
The Dangerous Illusion of AI Coding? - Jeremy Howard· youtube
Facebook AI's DINO | PyTorch Code Explained· youtube
DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!· youtube
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · self-DIstillation with NO labels · Residual Connection · Softmax · Multi-Head Attention · Byte Pair Encoding · Layer Normalization
