SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
Bo Gu, Zhikang Zhang, Zizhuang Wei, Zhenyuan Chen, Lingyun Li, Zhuoyi Song

TL;DR
SpaceMind++ introduces an allocentric 3D cognitive map for video multimodal models, enabling spatially consistent reasoning and improved generalization in 3D environments.
Contribution
It proposes a novel voxelized cognitive map architecture and a coordinate-guided fusion mechanism to integrate 3D spatial knowledge into pretrained video MLLMs.
Findings
Achieves state-of-the-art on VSI-Bench.
Demonstrates superior out-of-distribution generalization on multiple benchmarks.
Preserves object permanence and spatial topology across viewpoints.
Abstract
Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
