Investigating Fine- and Coarse-grained Structural Correspondences Between Deep Neural Networks and Human Object Image Similarity Judgments Using Unsupervised Alignment
Soh Takahashi, Masaru Sasaki, Ken Takeda, Masafumi Oizumi

TL;DR
This study uses an unsupervised alignment method to compare human and deep neural network object representations at multiple levels, revealing that CLIP models closely match human representations, especially with linguistic information, while self-supervised models mainly capture coarse categories.
Contribution
Introduces a novel unsupervised Gromov-Wasserstein alignment method to compare human and model object representations at fine and coarse levels, providing new insights into their similarities.
Findings
CLIP models show strong fine- and coarse-grained alignment with human representations.
Self-supervised models primarily capture coarse category structures.
Linguistic information enhances the acquisition of detailed object representations.
Abstract
The learning mechanisms by which humans acquire internal representations of objects are not fully understood. Deep neural networks (DNNs) have emerged as a useful tool for investigating this question, as they have internal representations similar to those of humans as a byproduct of optimizing their objective functions. While previous studies have shown that models trained with various learning paradigms - such as supervised, self-supervised, and CLIP - acquire human-like representations, it remains unclear whether their similarity to human representations is primarily at a coarse category level or extends to finer details. Here, we employ an unsupervised alignment method based on Gromov-Wasserstein Optimal Transport to compare human and model object representations at both fine-grained and coarse-grained levels. The unique feature of this method compared to conventional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Face Recognition and Perception · Generative Adversarial Networks and Image Synthesis
MethodsContrastive Language-Image Pre-training
