DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth\'ee Darcet, Th\'eo Moutakanni, Huy Vo, Marc, Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa,, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell, Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra

TL;DR
This paper introduces DINOv2, a self-supervised vision model trained on a large, curated dataset, producing robust, all-purpose visual features that outperform existing models across various tasks without fine-tuning.
Contribution
It presents a scalable self-supervised training approach with a curated dataset and a large ViT model, achieving state-of-the-art all-purpose visual features.
Findings
Large-scale self-supervised training improves feature robustness.
DINOv2 surpasses previous models like OpenCLIP on multiple benchmarks.
Curated datasets enhance training stability and model performance.
Abstract
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/dinov2-basemodel· 1.1M dl· ♡ 1741.1M dl♡ 174
- 🤗facebook/dinov2-largemodel· 1.5M dl· ♡ 1031.5M dl♡ 103
- 🤗facebook/dinov2-giantmodel· 215k dl· ♡ 59215k dl♡ 59
- 🤗facebook/dinov2-smallmodel· 2.2M dl· ♡ 612.2M dl♡ 61
- 🤗facebook/dinov2-with-registers-largemodel· 113k dl· ♡ 12113k dl♡ 12
- 🤗heig-vd-geo/PTv3_GridNet-HD_baselinemodel· ♡ 1♡ 1
- 🤗xtxx/Digepathmodel· 14 dl· ♡ 614 dl♡ 6
- 🤗timm/vit_base_patch14_dinov2.lvd142mmodel· 1.5M dl· ♡ 91.5M dl♡ 9
- 🤗timm/vit_giant_patch14_dinov2.lvd142mmodel· 5.0k dl· ♡ 15.0k dl♡ 1
- 🤗timm/vit_large_patch14_dinov2.lvd142mmodel· 127k dl· ♡ 17127k dl♡ 17
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsAdam · 1-bit Adam
