Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models

Aryan Roy; Zekun Wang; Christopher J. MacLellan

arXiv:2601.18065·cs.CL·January 27, 2026

Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models

Aryan Roy, Zekun Wang, Christopher J. MacLellan

PDF

Open Access

TL;DR

This study investigates whether vision-language models develop human-like sensitivity to linguistic concreteness more than text-only models, revealing that VLMs show stronger concreteness effects, more structured representations, and better alignment with human judgments.

Contribution

It provides a comprehensive comparison demonstrating that multimodal pretraining enhances models' concreteness sensitivity and human-like judgment alignment across multiple evaluation levels.

Findings

01

VLMs perform better on concrete inputs in QA accuracy.

02

Representations in VLMs are organized along a concreteness axis.

03

VLMs produce concreteness ratings more aligned with human norms.

Abstract

Do vision--language models (VLMs) develop more human-like sensitivity to linguistic concreteness than text-only large language models (LLMs) when both are evaluated with text-only prompts? We study this question with a controlled comparison between matched Llama text backbones and their Llama Vision counterparts across multiple model scales, treating multimodal pretraining as an ablation on perceptual grounding rather than access to images at inference. We measure concreteness effects at three complementary levels: (i) output behavior, by relating question-level concreteness to QA accuracy; (ii) embedding geometry, by testing whether representations organize along a concreteness axis; and (iii) attention dynamics, by quantifying context reliance via attention-entropy measures. In addition, we elicit token-level concreteness ratings from models and evaluate alignment to human norm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Language, Metaphor, and Cognition