Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation

Ronan Docherty; Antonis Vamvakeros; Samuel J. Cooper

arXiv:2410.19836·cs.CV·August 7, 2025

Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation

Ronan Docherty, Antonis Vamvakeros, Samuel J. Cooper

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates how upsampled features from self-supervised ViT models like DINOv2 can enhance unsupervised object localization, segmentation, and materials characterization by capturing complex semantic and positional information.

Contribution

It introduces a novel approach of leveraging upsampled ViT features in clustering and classification workflows for improved weakly supervised vision tasks.

Findings

01

Strong performance in object localization and segmentation benchmarks.

02

ViT features outperform classical methods in capturing complex relationships.

03

Enhanced materials segmentation and property prediction capabilities.

Abstract

The features of self-supervised vision transformers (ViTs) contain strong semantic and positional information relevant to downstream tasks like object localization and segmentation. Recent works combine these features with traditional methods like clustering, graph partitioning or region correlations to achieve impressive baselines without finetuning or training additional networks. We leverage upsampled features from ViT networks (e.g DINOv2) in two workflows: in a clustering based approach for object localization and segmentation, and paired with standard classifiers in weakly supervised materials segmentation. Both show strong performance on benchmarks, especially in weakly supervised segmentation where the ViT features capture complex relationships inaccessible to classical approaches. We expect the flexibility and generalizability of these features will both speed up and strengthen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tldr-group/HR-Dv2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection · Advanced X-ray and CT Imaging · Advanced Neural Network Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings