Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models
Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, David Acuna

TL;DR
This paper introduces a benchmark and a prompting technique to enhance large vision-language models' ability to perform quantitative spatial reasoning using reference objects, revealing significant performance improvements without additional training.
Contribution
The work presents Q-Spatial Bench for evaluating spatial reasoning and proposes SpatialPrompt, a zero-shot method that significantly improves VLMs' reasoning accuracy using reference objects.
Findings
VLMs struggle with distance reasoning in images.
Using reference objects in responses boosts VLM performance.
SpatialPrompt improves success rates by over 40 points for top models.
Abstract
Despite recent advances demonstrating vision-language models' (VLMs) abilities to describe complex relationships in images using natural language, their capability to quantitatively reason about object sizes and distances remains underexplored. In this work, we introduce a manually annotated benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning and systematically investigate the performance of state-of-the-art VLMs on this task. Our analysis reveals that reasoning about distances between objects is particularly challenging for SoTA VLMs; however, some VLMs significantly outperform others, with an over 40-point gap between the two best performing models. We also make the surprising observation that the success rate of the top-performing VLM increases by 19 points when a reasoning path using a reference object emerges naturally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsConstraint Satisfaction and Optimization · Multimodal Machine Learning Applications
