SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

Ruosen Zhao; Zhikang Zhang; Jialei Xu; Jiahao Chang; Dong Chen; Lingyun Li; Weijian Sun; Zizhuang Wei

arXiv:2511.23075·cs.CV·December 5, 2025

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, Zizhuang Wei

PDF

Open Access

TL;DR

SpaceMind introduces a camera-guided modality fusion approach in vision-language models, significantly improving 3D spatial reasoning capabilities using only RGB inputs, and achieves state-of-the-art results on multiple benchmarks.

Contribution

The paper presents a novel camera-guided fusion module that enhances spatial reasoning in VLMs without relying on auxiliary 3D data, advancing multimodal understanding.

Findings

01

Achieves new state-of-the-art on VSI-Bench, SQA3D, and SPBench.

02

Outperforms existing methods on spatial reasoning tasks.

03

Demonstrates the effectiveness of camera-guided fusion for spatial grounding.

Abstract

Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization