Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps
Xiangjun Gao, Zhensong Zhang, Dave Zhenyu Chen, Songcen Xu, Long Quan, Eduardo P\'erez-Pellitero, Youngkyoon Jang

TL;DR
Map2Thought introduces an interpretable 3D spatial reasoning framework for vision-language models, combining metric-based spatial representations with explicit geometric reasoning to improve explainability and performance with less supervision.
Contribution
The paper presents a novel framework integrating Metric Cognitive Maps and Chain-of-Thought reasoning for explicit 3D spatial understanding in vision-language models.
Findings
Achieves 59.9% accuracy with half supervision, close to full supervision baseline.
Outperforms state-of-the-art methods by over 4% on VSI-Bench.
Enables explainable 3D reasoning with improved accuracy across training data subsets.
Abstract
We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Constraint Satisfaction and Optimization
