Vision language models are blind: Failing to translate detailed visual features into words
Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh, Totti Nguyen

TL;DR
Large vision-language models excel at high-level tasks but struggle with low-level, spatially precise visual tasks, revealing a gap in translating detailed visual features into words.
Contribution
This paper demonstrates that current VLMs fail on simple spatial tasks due to inability to decode visual details into language, highlighting a critical limitation.
Findings
VLMs achieve only 58.07% accuracy on BlindTest tasks.
Performance improves to 77.84% with Claude 3.5 Sonnet.
VLMs struggle with precise spatial information but succeed when shapes are separated.
Abstract
While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, score high on many vision-understanding benchmarks, they are still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks, including identifying (a) whether two circles overlap; (b) how many times two lines intersect; (c) which letter is being circled in a word; and (d) the number of circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.07% accurate on average. Claude 3.5 Sonnet performs the best at 77.84% accuracy, far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs including slow-thinking models consistently struggle with those tasks that require precise spatial information when geometric primitives overlap or are close. Yet, VLMs perform at near-100%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
