Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron; Hugo Touvron; Ishan Misra; Herv\'e J\'egou; Julien; Mairal; Piotr Bojanowski; Armand Joulin

arXiv:2104.14294·cs.CV·May 25, 2021

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv\'e J\'egou, Julien, Mairal, Piotr Bojanowski, Armand Joulin

PDF

5 Repos 10 Models 4 Videos

TL;DR

This paper investigates unique properties of self-supervised Vision Transformers, revealing their superior semantic segmentation features and classification performance, and introduces DINO, a simple self-supervised method that enhances ViT performance.

Contribution

The paper uncovers that self-supervised ViTs encode explicit semantic information and introduces DINO, a novel self-distillation approach that improves ViT performance on ImageNet.

Findings

01

Self-supervised ViTs contain explicit semantic segmentation information.

02

Self-supervised ViTs achieve 78.3% top-1 accuracy with k-NN on ImageNet.

03

DINO improves ViT-Base linear evaluation to 80.1% top-1 accuracy.

Abstract

In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Research Explained)· youtube

The Dangerous Illusion of AI Coding? - Jeremy Howard· youtube

Facebook AI's DINO | PyTorch Code Explained· youtube

DINO: Emerging Properties in Self-Supervised Vision Transformers | Paper Explained!· youtube

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · self-DIstillation with NO labels · Residual Connection · Softmax · Multi-Head Attention · Byte Pair Encoding · Layer Normalization