Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Samyak Rawlekar; Amitabh Swain; Yujun Cai; Yiwei Wang; Ming-Hsuan Yang; Narendra Ahuja

arXiv:2603.26127·cs.CV·March 30, 2026

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja

PDF

TL;DR

This paper reveals that object-centric information in self-supervised Vision Transformers is distributed across all layers and components, and introduces Object-DINO, a training-free method to extract this information for improved object discovery and grounding.

Contribution

The paper uncovers the distributed nature of object-centric features in ViTs and proposes Object-DINO, a novel method for extracting this information without additional training.

Findings

01

Object-centric properties are encoded in similarity maps from all three components ($q, k, v$).

02

Object-centric information is distributed across the network layers.

03

Object-DINO improves unsupervised object discovery and visual grounding tasks.

Abstract

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ( $q, k, v$ ), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.