Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?
Yihao Li, Saeed Salehi, Lyle Ungar, Konrad P. Kording

TL;DR
This paper investigates whether large pretrained Vision Transformers naturally develop the ability to bind object features together, finding that this capability emerges through specific pretraining objectives and actively guides attention.
Contribution
It demonstrates that object binding, a key aspect of human cognition, naturally emerges in Vision Transformers trained with certain objectives, challenging previous assumptions.
Findings
Object binding can be decoded with over 90% accuracy from ViT embeddings.
Emergence of object binding depends on pretraining objectives, stronger in DINO, CLIP, and supervised models.
Object binding signals are encoded in a low-dimensional subspace and influence attention mechanisms.
Abstract
Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
