VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
Haz Sameen Shahgir, Xiaofu Chen, Yu Fu, Erfan Shayegani, Nael Abu-Ghazaleh, Yova Kementchedjhieva, Yue Dong

TL;DR
Vision-language models excel in semantic reasoning but struggle with fine-grained visual details unless entities are nameable, revealing a reliance on language over visual perception.
Contribution
This work uncovers that VLMs depend on semantic labels for visual details and demonstrates that teaching arbitrary names or finetuning enhances visual perception capabilities.
Findings
VLMs perform better with nameable entities than unnameable ones.
Explicit semantic labels are recovered by VLMs for known entities.
Task-specific finetuning improves visual perception without language priors.
Abstract
Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they often fail on tasks that require fine-grained visual perception, even when the required information is still present in their internal representations. Prior work has attributed this ``hidden-in-plain-sight'' gap to the language model, but the cause remains unexplained. In this work, we demonstrate that this gap arises from the language model's lack of semantic labels for fine-grained visual details: when visual entities can be mapped to known concepts, VLMs bypass visual comparison and reason through language; when they cannot, VLMs resort to brittle and hallucinated descriptions. We verify this across semantic correspondence, synthetic shape matching, and face matching, and find that VLMs perform much better when the relevant entities are nameable than when they…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
