Caption This, Reason That: VLMs Caught in the Middle
Zihan Weng, Lucas Gomez, Taylor Whittington Webb, Pouya Bashivan

TL;DR
This paper analyzes the cognitive strengths and weaknesses of vision-language models (VLMs) using cognitive science methodologies, revealing gaps in spatial and attention tasks and proposing targeted fine-tuning as a potential improvement strategy.
Contribution
It introduces a cognitive framework for evaluating VLMs, identifies key limitations in reasoning abilities, and demonstrates that fine-tuning smaller models enhances core cognitive skills.
Findings
Advanced models excel in category identification
Significant gaps remain in spatial and attention tasks
Fine-tuning improves core reasoning abilities
Abstract
Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g. category identification), a significant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, finding that models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Data Visualization and Analytics
