Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

Amirmohammad Izadi; Mohammad Ali Banayeeanzade; Fatemeh Askari; Ali Rahimiakbar; Mohammad Mahdi Vahedi; Hosein Hasani; Mahdieh Soleymani Baghshah

arXiv:2506.22146·cs.CV·November 11, 2025

Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

Amirmohammad Izadi, Mohammad Ali Banayeeanzade, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah

PDF

Open Access

TL;DR

This paper introduces VISER, a method that enhances visual reasoning in LVLMs by incorporating spatial structures into visual inputs, leading to significant performance improvements in core tasks.

Contribution

VISER is a novel approach that augments visual inputs with spatial structures and prompts, improving reasoning capabilities in LVLMs beyond purely textual methods.

Findings

01

Improves GPT-4o performance on visual search, counting, and spatial tasks by over 25%.

02

Reduces scene description error by 0.32 in edit distance.

03

Visual input structuring is crucial; textual strategies alone are insufficient.

Abstract

Despite progress in Large Vision-Language Models (LVLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current LVLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces Visual Input Structure for Enhanced Reasoning (VISER), a simple, effective method that augments visual inputs with low-level spatial structures and pairs them with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks, using only a single-query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques