SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete, Florence, Dorsa Sadigh, Leonidas Guibas, Fei Xia

TL;DR
SpatialVLM introduces a novel approach by training vision-language models with large-scale 3D spatial reasoning data, significantly improving their ability to understand and reason about spatial relationships in real-world scenarios.
Contribution
This work is the first to develop and utilize an internet-scale 3D spatial reasoning dataset for training VLMs, enhancing their spatial reasoning capabilities in VQA and robotics.
Findings
Enhanced spatial reasoning in VQA tasks.
Ability to perform quantitative spatial estimations.
Improved downstream applications in robotics.
Abstract
Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗remyxai/SpaceLLaVAmodel· 247 dl· ♡ 26247 dl♡ 26
- 🤗remyxai/SpaceLLaVA-litemodel· 21 dl· ♡ 321 dl♡ 3
- 🤗remyxai/SpaceLlama3.1model· 10 dl· ♡ 810 dl♡ 8
- 🤗remyxai/SpaceLlama3.1-hfmodel· 6 dl· ♡ 26 dl♡ 2
- 🤗remyxai/SpaceMantismodel· 65 dl· ♡ 165 dl♡ 1
- 🤗remyxai/SpaceFlorence-2model· 4 dl· ♡ 24 dl♡ 2
- 🤗remyxai/SpaceMinitron-4Bmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗remyxai/SpaceQwen2.5-VL-3B-Instructmodel· 823 dl· ♡ 18823 dl♡ 18
- 🤗remyxai/SpaceThinker-Qwen2.5VL-3Bmodel· 638 dl· ♡ 27638 dl♡ 27
- 🤗remyxai/SpaceOmmodel· 464 dl· ♡ 12464 dl♡ 12
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Geographic Information Systems Studies
