RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, Stan Birchfield

TL;DR
RoboSpatial introduces a large-scale dataset with rich spatial annotations from real indoor scenes to enhance the spatial reasoning capabilities of vision-language models in robotics, addressing current limitations in understanding reference frames.
Contribution
The paper presents RoboSpatial, a comprehensive dataset combining 2D images and 3D scans with detailed spatial annotations to improve robotic spatial understanding in vision-language models.
Findings
Models trained on RoboSpatial outperform baselines in spatial reasoning tasks.
The dataset enables better understanding of reference frames in robotic perception.
Enhanced spatial reasoning improves robot manipulation capabilities.
Abstract
Spatial understanding is a crucial capability that enables robots to perceive their surroundings, reason about their environment, and interact with it meaningfully. In modern robotics, these capabilities are increasingly provided by vision-language models. However, these models face significant challenges in spatial reasoning tasks, as their training data are based on general-purpose image datasets that often lack sophisticated spatial understanding. For example, datasets frequently do not capture reference frame comprehension, yet effective spatial reasoning requires understanding whether to reason from ego-, world-, or object-centric perspectives. To address this issue, we introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics. It consists of real indoor and tabletop scenes, captured as 3D scans and egocentric images, and annotated with rich spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Semantic Web and Ontologies · Robotics and Sensor-Based Localization
