Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs
Fangrui Zhu, Hanhui Wang, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang

TL;DR
Struct2D introduces a perception-guided prompting framework that enables multimodal large language models to perform spatial reasoning using only structured 2D representations, eliminating the need for explicit 3D inputs.
Contribution
We propose Struct2D, a novel framework that leverages structured 2D perception inputs for spatial reasoning in MLLMs and create a large-scale dataset for instruction tuning.
Findings
MLLMs show strong spatial reasoning with structured 2D inputs
Fine-tuning on Struct2D-Set improves performance on spatial tasks
Structured 2D inputs bridge perception and language reasoning effectively
Abstract
Unlocking spatial reasoning in Multimodal Large Language Models (MLLMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can MLLMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird's-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source MLLMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
