Deep ViT Features as Dense Visual Descriptors
Shir Amir, Yossi Gandelsman, Shai Bagon, Tali Dekel

TL;DR
This paper demonstrates that features from a pretrained Vision Transformer (ViT) can serve as effective dense visual descriptors, enabling various image segmentation and correspondence tasks without additional training.
Contribution
It reveals the semantic and spatial properties of ViT features and introduces simple zero-shot methods for segmentation and correspondence tasks that outperform prior unsupervised approaches.
Findings
ViT features encode localized semantic information.
Features are shared across related object categories.
Zero-shot methods achieve state-of-the-art results.
Abstract
We study the use of deep features extracted from a pretrained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extractedfrom a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Label Smoothing · Byte Pair Encoding · Softmax · Absolute Position Encodings · Adam · Vision Transformer
