Imagine in Space: Exploring the Frontier of Spatial Intelligence and Reasoning Efficiency in Vision Language Models
Xiaoxing Lian, Aidong Yang, Jun Zhu, Peng Wang, Yue Zhang

TL;DR
This paper investigates the spatial reasoning capabilities of vision language models, revealing their reliance on linguistic cues, inefficiencies, and proposing a new framework to enhance their internal spatial understanding.
Contribution
The paper introduces SpatiaLite, a synthetic benchmark for spatial reasoning, and proposes the Imagery Driven Framework to improve VLMs' internal spatial modeling.
Findings
VLMs rely mainly on linguistic representations for reasoning.
VLMs show significant inefficiency with increasing transformation complexity.
The proposed IDF framework enhances internal spatial reasoning in VLMs.
Abstract
Large language models (LLMs) and vision language models (VLMs), such as DeepSeek R1,OpenAI o3, and Gemini 2.5 Pro, have demonstrated remarkable reasoning capabilities across logical inference, problem solving, and decision making. However, spatial reasoning:a fundamental component of human cognition that includes mental rotation, navigation, and spatial relationship comprehension remains a significant challenge for current advanced VLMs. We hypothesize that imagination, the internal simulation of spatial states, is the dominant reasoning mechanism within a spatial world model. To test this hypothesis and systematically probe current VLM spatial reasoning mechanisms, we introduce SpatiaLite, a fully synthetic benchmark that jointly measures spatial reasoning accuracy and reasoning efficiency. Comprehensive experiments reveal three key findings. First, advanced VLMs predominantly rely on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpatial Cognition and Navigation · Multimodal Machine Learning Applications · Constraint Satisfaction and Optimization
