Human-like Object Grouping in Self-supervised Vision Transformers
Hossein Adeli, Seoyoung Ahn, Andrew Luo, Mengmi Zhang, Nikolaus Kriegeskorte, Gregory Zelinsky

TL;DR
This study evaluates how self-supervised vision transformers align with human object perception, revealing that certain training methods and internal representations improve their similarity to human segmentation behavior.
Contribution
The paper introduces a behavioral benchmark for object perception, analyzes the impact of model architecture and training objectives, and proposes a new metric linking object-centric representations to human-like perception.
Findings
Transformer models trained with DINO show strongest alignment with human perception.
Stronger object-centric structure in models predicts better segmentation behavior.
Matching Gram matrices across models enhances their perceptual similarity to humans.
Abstract
Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace Recognition and Perception · Visual Attention and Saliency Detection · Advanced Neural Network Applications
