Near, far: Patch-ordering enhances vision foundation models' scene   understanding

Valentinos Pariza; Mohammadreza Salehi; Gertjan Burghouts; Francesco; Locatello; Yuki M. Asano

arXiv:2408.11054·cs.CV·April 18, 2025

Near, far: Patch-ordering enhances vision foundation models' scene understanding

Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco, Locatello, Yuki M. Asano

PDF

Open Access 1 Models

TL;DR

NeCo introduces a self-supervised patch-level neighbor consistency loss that enhances vision foundation models' scene understanding by leveraging differentiable sorting on pretrained features, leading to state-of-the-art results.

Contribution

The paper proposes NeCo, a novel dense self-supervised training method using differentiable sorting to improve feature representations in vision models.

Findings

01

Achieved +5.5% and +6% in non-parametric in-context semantic segmentation.

02

Improved linear segmentation accuracy by +7.2% and +5.7%.

03

Enhanced 3D multi-view consistency by over 1.5%.

Abstract

We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training loss that enforces patch-level nearest neighbor consistency across a student and teacher model. Compared to contrastive approaches that only yield binary learning signals, i.e., 'attract' and 'repel', this approach benefits from the more fine-grained learning signal of sorting spatially dense features relative to reference patches. Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. This method generates high-quality dense feature encoders and establishes several new state-of-the-art results such as +5.5% and +6% for non-parametric in-context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FunAILab/NeCo
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Medical Image Segmentation Techniques