SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

Bo Gu; Zhikang Zhang; Zizhuang Wei; Zhenyuan Chen; Lingyun Li; Zhuoyi Song

arXiv:2605.09449·cs.CV·May 12, 2026

SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

Bo Gu, Zhikang Zhang, Zizhuang Wei, Zhenyuan Chen, Lingyun Li, Zhuoyi Song

PDF

TL;DR

SpaceMind++ introduces an allocentric 3D cognitive map for video multimodal models, enabling spatially consistent reasoning and improved generalization in 3D environments.

Contribution

It proposes a novel voxelized cognitive map architecture and a coordinate-guided fusion mechanism to integrate 3D spatial knowledge into pretrained video MLLMs.

Findings

01

Achieves state-of-the-art on VSI-Bench.

02

Demonstrates superior out-of-distribution generalization on multiple benchmarks.

03

Preserves object permanence and spatial topology across viewpoints.

Abstract

Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.