Human-like Object Grouping in Self-supervised Vision Transformers

Hossein Adeli; Seoyoung Ahn; Andrew Luo; Mengmi Zhang; Nikolaus Kriegeskorte; Gregory Zelinsky

arXiv:2603.13994·cs.CV·March 17, 2026

Human-like Object Grouping in Self-supervised Vision Transformers

Hossein Adeli, Seoyoung Ahn, Andrew Luo, Mengmi Zhang, Nikolaus Kriegeskorte, Gregory Zelinsky

PDF

Open Access

TL;DR

This study evaluates how self-supervised vision transformers align with human object perception, revealing that certain training methods and internal representations improve their similarity to human segmentation behavior.

Contribution

The paper introduces a behavioral benchmark for object perception, analyzes the impact of model architecture and training objectives, and proposes a new metric linking object-centric representations to human-like perception.

Findings

01

Transformer models trained with DINO show strongest alignment with human perception.

02

Stronger object-centric structure in models predicts better segmentation behavior.

03

Matching Gram matrices across models enhances their perceptual similarity to humans.

Abstract

Vision foundation models trained with self-supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer-based models trained with the DINO self-supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace Recognition and Perception · Visual Attention and Saliency Detection · Advanced Neural Network Applications