TL;DR
Grid2Matrix (G2M) is a new benchmark revealing that vision-language models often fail to faithfully capture all visual details, especially in complex grids, exposing a gap called Digital Agnosia.
Contribution
The paper introduces G2M, a controlled benchmark to analyze visual detail retention in VLMs and uncovers a systematic failure mode called Digital Agnosia.
Findings
VLMs fail early on small grids in zero-shot evaluation.
Visual encoders retain more information than end-to-end outputs.
Failures depend on grid cell overlap with visual patches.
Abstract
Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complexity while minimizing semantic confounds. We find that VLMs exhibit a sharp early collapse in zero-shot end-to-end evaluation, failing on surprisingly small grids rather than degrading gradually as the task becomes denser. We probe the visual encoders of VLMs from two representative families and find that they preserve substantially more of the grid information than the corresponding end-to-end…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
