# Comparing perceptual judgments in large multimodal models and humans

**Authors:** Billy Dickson, Sahaj Singh Maini, Craig Sanders, Robert Nosofsky, Zoran Tiganj

PMC · DOI: 10.3758/s13428-025-02728-w · 2025-06-19

## TL;DR

This paper compares how well large multimodal models like GPT-4o and humans judge perceptual features of rock images, finding that the model aligns well with humans on basic features but less so on abstract ones.

## Contribution

The study introduces a benchmark for evaluating LMMs using human perceptual judgment data from cognitive science.

## Key findings

- GPT-4o showed strong correlation with human ratings for basic perceptual dimensions like lightness and texture.
- The model's alignment with humans was weaker for abstract rock-specific features like organization and pegmatitic structure.
- LMMs like GPT-4o are approaching the level of human consensus on perceptual features of rock images.

## Abstract

Cognitive scientists commonly collect participants' judgments regarding perceptual characteristics of stimuli to develop and evaluate models of attention, memory, learning, and decision-making. For instance, to model human responses in tasks of category learning and item recognition, researchers often collect perceptual judgments of images in order to embed the images in multidimensional feature spaces. This process is time-consuming and costly. Recent advancements in large multimodal models (LMMs) provide a potential alternative because such models can respond to prompts that include both text and images and could potentially replace human participants. To test whether the available LMMs can indeed be useful for this purpose, we evaluated their judgments on a dataset consisting of rock images that has been widely used by cognitive scientists. The dataset includes human perceptual judgments along 10 dimensions considered important for classifying rock images. Among the LMMs that we investigated, GPT-4o exhibited the strongest positive correlation with human responses and demonstrated promising alignment with the mean ratings from human participants, particularly for elementary dimensions such as lightness, chromaticity, shininess, and fine/coarse grain texture. However, its correlations with human ratings were lower for more abstract and rock-specific emergent dimensions such as organization and pegmatitic structure. Although there is room for further improvement, the model already appears to be approaching the level of consensus observed across human groups for the perceptual features examined here. Our study provides a benchmark for evaluating future LMMs on human perceptual judgment data.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12178973/full.md

---
Source: https://tomesphere.com/paper/PMC12178973