Vision language models are blind: Failing to translate detailed visual   features into words

Pooyan Rahmanzadehgervi; Logan Bolton; Mohammad Reza Taesiri; Anh; Totti Nguyen

arXiv:2407.06581·cs.AI·March 28, 2025

Vision language models are blind: Failing to translate detailed visual features into words

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh, Totti Nguyen

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Large vision-language models excel at high-level tasks but struggle with low-level, spatially precise visual tasks, revealing a gap in translating detailed visual features into words.

Contribution

This paper demonstrates that current VLMs fail on simple spatial tasks due to inability to decode visual details into language, highlighting a critical limitation.

Findings

01

VLMs achieve only 58.07% accuracy on BlindTest tasks.

02

Performance improves to 77.84% with Claude 3.5 Sonnet.

03

VLMs struggle with precise spatial information but succeed when shapes are separated.

Abstract

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, score high on many vision-understanding benchmarks, they are still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks, including identifying (a) whether two circles overlap; (b) how many times two lines intersect; (c) which letter is being circled in a word; and (d) the number of circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.07% accurate on average. Claude 3.5 Sonnet performs the best at 77.84% accuracy, far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs including slow-thinking models consistently struggle with those tasks that require precise spatial information when geometric primitives overlap or are close. Yet, VLMs perform at near-100%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anguyen8/vision-llms-are-blind
noneOfficial

Datasets

XAI/vlmsareblind
dataset· 1.1k dl
1.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications