Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment

Soh Takahashi; Masaru Sasaki; Ken Takeda; Masafumi Oizumi

arXiv:2505.16419·cs.CV·December 2, 2025

Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment

Soh Takahashi, Masaru Sasaki, Ken Takeda, Masafumi Oizumi

PDF

Open Access

TL;DR

This study uses an unsupervised alignment method to compare human and deep neural network object representations at multiple levels, revealing that CLIP models closely match human representations, especially with linguistic information, while self-supervised models mainly capture coarse categories.

Contribution

Introduces a novel unsupervised Gromov-Wasserstein alignment method to compare human and model object representations at fine and coarse levels, providing new insights into their similarities.

Findings

01

CLIP models show strong fine- and coarse-grained alignment with human representations.

02

Self-supervised models primarily capture coarse category structures.

03

Linguistic information enhances the acquisition of detailed object representations.

Abstract

The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms - such as supervised, self-supervised, and CLIP - acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Face Recognition and Perception · Generative Adversarial Networks and Image Synthesis

MethodsContrastive Language-Image Pre-training