Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan

TL;DR
Spatial-MLLM introduces a novel framework that enhances 2D visual-based spatial reasoning in multimodal large language models by leveraging a dual-encoder architecture and a space-aware frame sampling strategy.
Contribution
The paper presents a new dual-encoder architecture and a space-aware sampling method, enabling improved spatial reasoning from 2D inputs without relying on additional 3D data.
Findings
Achieves state-of-the-art performance on spatial understanding tasks.
Effectively integrates semantic and 3D structure features.
Demonstrates robustness across multiple real-world datasets.
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation
