Deep ViT Features as Dense Visual Descriptors

Shir Amir; Yossi Gandelsman; Shai Bagon; Tali Dekel

arXiv:2112.05814·cs.CV·October 18, 2022

Deep ViT Features as Dense Visual Descriptors

Shir Amir, Yossi Gandelsman, Shai Bagon, Tali Dekel

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that features from a pretrained Vision Transformer (ViT) can serve as effective dense visual descriptors, enabling various image segmentation and correspondence tasks without additional training.

Contribution

It reveals the semantic and spatial properties of ViT features and introduces simple zero-shot methods for segmentation and correspondence tasks that outperform prior unsupervised approaches.

Findings

01

ViT features encode localized semantic information.

02

Features are shared across related object categories.

03

Zero-shot methods achieve state-of-the-art results.

Abstract

We study the use of deep features extracted from a pretrained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extractedfrom a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shiramir/dino-vit-features
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Label Smoothing · Byte Pair Encoding · Softmax · Absolute Position Encodings · Adam · Vision Transformer