Hidden in plain sight: VLMs overlook their visual representations

Stephanie Fu; Tyler Bonnen; Devin Guillory; Trevor Darrell

arXiv:2506.08008·cs.CV·June 10, 2025

Hidden in plain sight: VLMs overlook their visual representations

Stephanie Fu, Tyler Bonnen, Devin Guillory, Trevor Darrell

PDF

Open Access

TL;DR

This paper investigates why vision language models (VLMs) underperform on vision-centric tasks, revealing that they rely heavily on language priors and do not effectively utilize visual representations, highlighting key failure modes.

Contribution

The study provides a comprehensive analysis of VLMs' limitations in visual tasks, identifying the bottleneck as the language model's influence over visual information processing.

Findings

01

VLMs perform worse than visual encoders on vision tasks

02

Performance drops to near-chance levels on benchmarks like depth estimation

03

VLMs are heavily influenced by language priors, limiting visual understanding

Abstract

Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model's role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Data Visualization and Analytics