Vision-language models learn the geometry of human perceptual space
Craig Sanders, Billy Dickson, Sahaj Singh Maini, Robert Nosofsky, Zoran Tiganj

TL;DR
This paper demonstrates that vision-language models learn a perceptual space similar to human cognition, and this geometry can predict human categorization better than direct human judgments, bridging AI and cognitive science.
Contribution
Introduces a method to analyze the internal geometry of VLMs and shows they capture human-like perceptual structures relevant for categorization.
Findings
VLMs recover low-dimensional perceptual spaces aligned with human perception.
AI-derived spaces predict human categorization more accurately than human judgments.
Provides a scalable approach to study cognitive representations using AI models.
Abstract
In cognitive science and AI, a longstanding question is whether machines learn representations that align with those of the human mind. While current models show promise, it remains an open question whether this alignment is superficial or reflects a deeper correspondence in the underlying dimensions of representation. Here we introduce a methodology to probe the internal geometry of vision-language models (VLMs) by having them generate pairwise similarity judgments for a complex set of natural objects. Using multidimensional scaling, we recover low-dimensional psychological spaces and find that their axes show a strong correspondence with the principal axes of human perceptual space. Critically, when this AI-derived representational geometry is used as the input to a classic exemplar model of categorization, it predicts human classification behavior more accurately than a space…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
