Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities
Shivam Chandhok, Wan-Cyuan Fan, Leonid Sigal

TL;DR
This paper investigates the fundamental limitations of state-of-the-art vision-language models in basic visual tasks, revealing surprising shortcomings and providing insights for future improvements beyond current benchmarks.
Contribution
The study introduces comprehensive tests probing core visual understanding skills and compares model components, uncovering unexpected weaknesses in VLMs not captured by standard benchmarks.
Findings
VLMs show limitations in object classification and spatial reasoning.
Intermediate features reveal more about model shortcomings than final outputs.
Probing different model components uncovers nascent response deficiencies.
Abstract
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, also lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks: object classification, understanding spatial arrangement, and ability to delineate individual object instances (through counting), by constructing a series of tests that probe which components of design, specifically, maybe lacking. Importantly, we go significantly beyond the current benchmarks, that simply measure final performance of VLM, by also comparing and contrasting it to performance of probes trained directly on features obtained from visual encoder (image embeddings), as well as intermediate vision-language projection used to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsSparse Evolutionary Training · BLIP: Bootstrapping Language-Image Pre-training
