Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

Yuval Levental

arXiv:2602.15950·cs.CV·February 24, 2026

Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

Yuval Levental

PDF

Open Access

TL;DR

This paper reveals that vision-language models excel at text recognition but struggle with spatial localization of non-textual elements like filled squares, exposing a fundamental limitation in their visual reasoning capabilities.

Contribution

The study demonstrates a significant gap in VLMs' ability to localize non-textual visual elements, highlighting a core weakness in their spatial reasoning that is not solely due to visual encoding.

Findings

01

VLMs perform well with text symbols (~90% accuracy)

02

Performance drops significantly with filled squares (~60-73% accuracy)

03

Models exhibit distinct failure modes in spatial localization

Abstract

We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types -- text symbols (. and #) and filled squares without gridlines -- then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder -- the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Language, Metaphor, and Cognition