SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

An-Chieh Cheng; Hongxu Yin; Yang Fu; Qiushan Guo; Ruihan Yang; Jan; Kautz; Xiaolong Wang; Sifei Liu

arXiv:2406.01584·cs.CV·October 16, 2024·1 cites

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan, Kautz, Xiaolong Wang, Sifei Liu

PDF

Open Access 1 Repo

TL;DR

SpatialRGPT significantly improves vision language models' ability to understand and reason about 3D spatial arrangements by integrating regional representations and depth information, supported by a new benchmark for evaluation.

Contribution

Introduces SpatialRGPT with a data pipeline and plugin module for enhanced 3D spatial reasoning in VLMs, along with a new benchmark for evaluation.

Findings

01

Enhanced spatial reasoning performance in VLMs.

02

Strong generalization to complex spatial relations.

03

Effective as a dense reward annotator for robotics.

Abstract

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (2) a flexible plugin module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, we propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments, for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

syscv/sam-hq
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Constraint Satisfaction and Optimization

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Softmax · Layer Normalization · Weight Decay · Attention Dropout · Linear Layer · Linear Warmup With Cosine Annealing