SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan, Kautz, Xiaolong Wang, Sifei Liu

TL;DR
SpatialRGPT significantly improves vision language models' ability to understand and reason about 3D spatial arrangements by integrating regional representations and depth information, supported by a new benchmark for evaluation.
Contribution
Introduces SpatialRGPT with a data pipeline and plugin module for enhanced 3D spatial reasoning in VLMs, along with a new benchmark for evaluation.
Findings
Enhanced spatial reasoning performance in VLMs.
Strong generalization to complex spatial relations.
Effective as a dense reward annotator for robotics.
Abstract
Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (2) a flexible plugin module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, we propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments, for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Constraint Satisfaction and Optimization
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Discriminative Fine-Tuning · Softmax · Layer Normalization · Weight Decay · Attention Dropout · Linear Layer · Linear Warmup With Cosine Annealing
