Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem
Declan Campbell, Sunayana Rane, Tyler Giallanza, Nicol\`o De Sabbata,, Kia Ghods, Amogh Joshi, Alexander Ku, Steven M. Frankland, Thomas L., Griffiths, Jonathan D. Cohen, Taylor W. Webb

TL;DR
This paper investigates the limitations of vision language models by linking their failures in multi-object reasoning to the cognitive binding problem, revealing similarities with human rapid processing constraints.
Contribution
It introduces a theoretical framework connecting VLM failures to the binding problem, offering insights into their core limitations and parallels with human cognition.
Findings
VLM failures in multi-object tasks are explained by the binding problem.
Similar limitations are observed between VLMs and rapid human visual processing.
The binding problem accounts for many of the models' surprising errors.
Abstract
Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs), including both multimodal language models and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks -- such as counting, localization, and simple forms of visual analogy -- that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition · linguistics and terminology studies · Categorization, perception, and language
MethodsSparse Evolutionary Training
