Near, far: Patch-ordering enhances vision foundation models' scene understanding
Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco, Locatello, Yuki M. Asano

TL;DR
NeCo introduces a self-supervised patch-level neighbor consistency loss that enhances vision foundation models' scene understanding by leveraging differentiable sorting on pretrained features, leading to state-of-the-art results.
Contribution
The paper proposes NeCo, a novel dense self-supervised training method using differentiable sorting to improve feature representations in vision models.
Findings
Achieved +5.5% and +6% in non-parametric in-context semantic segmentation.
Improved linear segmentation accuracy by +7.2% and +5.7%.
Enhanced 3D multi-view consistency by over 1.5%.
Abstract
We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training loss that enforces patch-level nearest neighbor consistency across a student and teacher model. Compared to contrastive approaches that only yield binary learning signals, i.e., 'attract' and 'repel', this approach benefits from the more fine-grained learning signal of sorting spatially dense features relative to reference patches. Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. This method generates high-quality dense feature encoders and establishes several new state-of-the-art results such as +5.5% and +6% for non-parametric in-context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Medical Image Segmentation Techniques
