DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab; Timoth\'ee Darcet; Th\'eo Moutakanni; Huy Vo; Marc; Szafraniec; Vasil Khalidov; Pierre Fernandez; Daniel Haziza; Francisco Massa,; Alaaeldin El-Nouby; Mahmoud Assran; Nicolas Ballas; Wojciech Galuba; Russell; Howes; Po-Yao Huang; Shang-Wen Li; Ishan Misra; Michael Rabbat; Vasu Sharma,; Gabriel Synnaeve; Hu Xu; Herv\'e Jegou; Julien Mairal; Patrick Labatut,; Armand Joulin; Piotr Bojanowski

arXiv:2304.07193·cs.CV·February 5, 2024·1.0k cites

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth\'ee Darcet, Th\'eo Moutakanni, Huy Vo, Marc, Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa,, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell, Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra

PDF

Open Access 5 Repos 10 Models 1 Datasets

TL;DR

This paper introduces DINOv2, a self-supervised vision model trained on a large, curated dataset, producing robust, all-purpose visual features that outperform existing models across various tasks without fine-tuning.

Contribution

It presents a scalable self-supervised training approach with a curated dataset and a large ViT model, achieving state-of-the-art all-purpose visual features.

Findings

01

Large-scale self-supervised training improves feature robustness.

02

DINOv2 surpasses previous models like OpenCLIP on multiple benchmarks.

03

Curated datasets enhance training stability and model performance.

Abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

milosvuk/GANcMRI
dataset· 95 dl
95 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsAdam · 1-bit Adam