DINOv3
Oriane Sim\'eoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha\"el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth\'ee Darcet, Th\'eo Moutakanni

TL;DR
DINOv3 is a versatile self-supervised vision model that scales effectively, introduces Gram anchoring to improve dense feature maps, and outperforms state-of-the-art models across many vision tasks without fine-tuning.
Contribution
The paper presents DINOv3, a new self-supervised learning method that leverages scaling, a novel Gram anchoring technique, and post-hoc strategies to enhance model flexibility and performance.
Findings
Outperforms specialized state-of-the-art models across various tasks.
Produces high-quality dense features for diverse vision applications.
Demonstrates effectiveness without fine-tuning across multiple datasets.
Abstract
Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/dinov3-vitl16-pretrain-lvd1689mmodel· 765k dl· ♡ 206765k dl♡ 206
- 🤗facebook/dinov3-vit7b16-pretrain-lvd1689mmodel· 30k dl· ♡ 22130k dl♡ 221
- 🤗facebook/dinov3-vits16-pretrain-lvd1689mmodel· 338k dl· ♡ 79338k dl♡ 79
- 🤗facebook/dinov3-vitb16-pretrain-lvd1689mmodel· 1.2M dl· ♡ 1131.2M dl♡ 113
- 🤗facebook/dinov3-vith16plus-pretrain-lvd1689mmodel· 158k dl· ♡ 50158k dl♡ 50
- 🤗facebook/dinov3-convnext-small-pretrain-lvd1689mmodel· 18k dl· ♡ 2418k dl♡ 24
- 🤗facebook/dinov3-convnext-large-pretrain-lvd1689mmodel· 5.6k dl· ♡ 195.6k dl♡ 19
- 🤗facebook/dinov3-convnext-base-pretrain-lvd1689mmodel· 42k dl· ♡ 1442k dl♡ 14
- 🤗facebook/dinov3-vits16plus-pretrain-lvd1689mmodel· 31k dl· ♡ 1431k dl♡ 14
- 🤗facebook/dinov3-convnext-tiny-pretrain-lvd1689mmodel· 94k dl· ♡ 3494k dl♡ 34
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
