Caption This, Reason That: VLMs Caught in the Middle

Zihan Weng; Lucas Gomez; Taylor Whittington Webb; Pouya Bashivan

arXiv:2505.21538·cs.CV·November 14, 2025

Caption This, Reason That: VLMs Caught in the Middle

Zihan Weng, Lucas Gomez, Taylor Whittington Webb, Pouya Bashivan

PDF

Open Access 1 Video

TL;DR

This paper analyzes the cognitive strengths and weaknesses of vision-language models (VLMs) using cognitive science methodologies, revealing gaps in spatial and attention tasks and proposing targeted fine-tuning as a potential improvement strategy.

Contribution

It introduces a cognitive framework for evaluating VLMs, identifies key limitations in reasoning abilities, and demonstrates that fine-tuning smaller models enhances core cognitive skills.

Findings

01

Advanced models excel in category identification

02

Significant gaps remain in spatial and attention tasks

03

Fine-tuning improves core reasoning abilities

Abstract

Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g. category identification), a significant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, finding that models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Caption This, Reason That: VLMs Caught in the Middle· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Data Visualization and Analytics