SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning   Capabilities

Boyuan Chen; Zhuo Xu; Sean Kirmani; Brian Ichter; Danny Driess; Pete; Florence; Dorsa Sadigh; Leonidas Guibas; Fei Xia

arXiv:2401.12168·cs.CV·January 23, 2024·5 cites

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete, Florence, Dorsa Sadigh, Leonidas Guibas, Fei Xia

PDF

Open Access 10 Models 2 Datasets

TL;DR

SpatialVLM introduces a novel approach by training vision-language models with large-scale 3D spatial reasoning data, significantly improving their ability to understand and reason about spatial relationships in real-world scenarios.

Contribution

This work is the first to develop and utilize an internet-scale 3D spatial reasoning dataset for training VLMs, enhancing their spatial reasoning capabilities in VQA and robotics.

Findings

01

Enhanced spatial reasoning in VQA tasks.

02

Ability to perform quantitative spatial estimations.

03

Improved downstream applications in robotics.

Abstract

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Geographic Information Systems Studies