SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, Zizhuang Wei

TL;DR
SpaceMind introduces a camera-guided modality fusion approach in vision-language models, significantly improving 3D spatial reasoning capabilities using only RGB inputs, and achieves state-of-the-art results on multiple benchmarks.
Contribution
The paper presents a novel camera-guided fusion module that enhances spatial reasoning in VLMs without relying on auxiliary 3D data, advancing multimodal understanding.
Findings
Achieves new state-of-the-art on VSI-Bench, SQA3D, and SPBench.
Outperforms existing methods on spatial reasoning tasks.
Demonstrates the effectiveness of camera-guided fusion for spatial grounding.
Abstract
Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
