The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding
Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene

TL;DR
This study evaluates vision-language models' ability to understand high-level scene concepts, revealing they excel in general knowledge but lack embodied understanding of affordances, highlighting the limits of distributional learning.
Contribution
The paper demonstrates that current VLMs struggle with affordance understanding and introduces a Human-Calibrated Cosine Distance metric for evaluation.
Findings
VLMs perform well on general knowledge tasks but poorly on affordance tasks.
The affordance gap is structural and not improved by newer models or explicit spatial info.
Image caption datasets lack sufficient agent-centered affordance language.
Abstract
What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
