Hidden in plain sight: VLMs overlook their visual representations
Stephanie Fu, Tyler Bonnen, Devin Guillory, Trevor Darrell

TL;DR
This paper investigates why vision language models (VLMs) underperform on vision-centric tasks, revealing that they rely heavily on language priors and do not effectively utilize visual representations, highlighting key failure modes.
Contribution
The study provides a comprehensive analysis of VLMs' limitations in visual tasks, identifying the bottleneck as the language model's influence over visual information processing.
Findings
VLMs perform worse than visual encoders on vision tasks
Performance drops to near-chance levels on benchmarks like depth estimation
VLMs are heavily influenced by language priors, limiting visual understanding
Abstract
Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Data Visualization and Analytics
