Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
Weijian Ma, Shizhao Sun, Tianyu Yu, Ruiyu Wang, Tat-Seng Chua, Jiang Bian

TL;DR
This paper introduces a structured object blueprint approach to improve vision-language models' spatial reasoning by integrating object positions, sizes, and attributes into a reasoning process, leading to better understanding of spatial relationships.
Contribution
It proposes a novel blueprint-based structured representation for spatial reasoning in VLMs, with techniques for supervised fine-tuning, reinforcement learning rewards, and data augmentation.
Findings
Outperforms existing VLMs and spatial reasoning models.
Enhances global spatial awareness in vision-language tasks.
Demonstrates improved reasoning accuracy on spatial questions.
Abstract
Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image patches, improving fine-grained perception but weakening global spatial awareness, or mark isolated coordinates, which capture object locations but overlook their overall organization. In this work, we integrate the cognitive concept of an object-centric blueprint into VLMs to enhance spatial reasoning. Given an image and a question, the model first constructs a JSON-style blueprint that records the positions, sizes, and attributes of relevant objects, and then reasons over this structured representation to produce the final answer. To achieve this, we introduce three key techniques: (1) blueprint-embedded reasoning traces for supervised fine-tuning to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Language, Metaphor, and Cognition
