Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu; Fangfu Liu; Yi-Hsin Hung; Yueqi Duan

arXiv:2505.23747·cs.CV·May 20, 2026

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan

PDF

2 Repos 3 Models 1 Datasets 1 Video

TL;DR

Spatial-MLLM introduces a novel framework that enhances 2D visual-based spatial reasoning in multimodal large language models by leveraging a dual-encoder architecture and a space-aware frame sampling strategy.

Contribution

The paper presents a new dual-encoder architecture and a space-aware sampling method, enabling improved spatial reasoning from 2D inputs without relying on additional 3D data.

Findings

01

Achieves state-of-the-art performance on spatial understanding tasks.

02

Effectively integrates semantic and 3D structure features.

03

Demonstrates robustness across multiple real-world datasets.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Diankun/Spatial-MLLM-Data
dataset· 106 dl
106 dl

Videos

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation