Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Runsen Xu; Weiyao Wang; Hao Tang; Xingyu Chen; Xiaodong Wang; Fu-Jen Chu; Dahua Lin; Matt Feiszli; Kevin J. Liang

arXiv:2505.17015·cs.CV·May 23, 2025

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, Kevin J. Liang

PDF

Open Access

TL;DR

This paper introduces Multi-SpatialMLLM, a multi-modal large language model enhanced with multi-frame spatial understanding, leveraging a new large-scale dataset and benchmark to improve robotics and real-world scene reasoning.

Contribution

The paper presents a novel framework integrating depth, correspondence, and dynamic perception into MLLMs, along with the MultiSPA dataset and a comprehensive spatial reasoning benchmark.

Findings

01

Significant performance improvements over baselines.

02

Demonstrated scalability and generalization in multi-frame reasoning.

03

Emergent capabilities in complex scenarios.

Abstract

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception. Central to our approach is the MultiSPA dataset, a novel, large-scale collection of more than 27 million samples spanning diverse 3D and 4D scenes. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning. We further observe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Domain Adaptation and Few-Shot Learning