SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Wufei Ma, Luoxin Ye, Celso M de Melo, Jieneng Chen, Alan Yuille

TL;DR
SpatialLLM introduces a novel large multimodal model with enhanced 3D spatial reasoning by developing specialized datasets and integrating architectural innovations, surpassing GPT-4o in 3D reasoning performance.
Contribution
The paper presents the first curated 3D-oriented VQA dataset and a systematic approach to integrating 3D data and architecture in large multimodal models for improved spatial reasoning.
Findings
SpatialLLM surpasses GPT-4o by 8.7% in 3D reasoning tasks.
Development of two types of 3D-informed training datasets.
Systematic analysis of data, architecture, and training impacts on 3D reasoning.
Abstract
Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study the impact of 3D-informed data, architecture, and training setups, introducing SpatialLLM, a large multi-modal model with advanced 3D spatial reasoning abilities. To address data limitations, we develop two types of 3D-informed training datasets: (1) 3D-informed probing data focused on object's 3D location and orientation, and (2) 3D-informed conversation data for complex spatial relationships. Notably, we are the first to curate VQA data that incorporate 3D orientation relationships on real…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence
