SpatialPoint: Spatial-aware Point Prediction for Embodied Localization
Qiming Zhu, Zhirui Fang, Tianming Zhang, Chuanxiu Liu, Xiaoke Jiang, Lei Zhang

TL;DR
SpatialPoint is a novel spatial-aware vision-language framework that enhances embodied localization by integrating structured depth information, enabling robots to predict precise 3D points for interaction and navigation tasks.
Contribution
The paper introduces SpatialPoint, a new model that incorporates depth into vision-language systems for improved 3D spatial reasoning in embodied localization.
Findings
Incorporating depth significantly improves localization accuracy.
Constructed a 2.6M RGB-D dataset for training and evaluation.
Validated on real robots across grasping, placement, and navigation tasks.
Abstract
Embodied intelligence fundamentally requires a capability to determine where to act in 3D space. We formalize this requirement as embodied localization -- the problem of predicting executable 3D points conditioned on visual observations and language instructions. We instantiate embodied localization with two complementary target types: touchable points, surface-grounded 3D points enabling direct physical interaction, and air points, free-space 3D points specifying placement and navigation goals, directional constraints, or geometric relations. Embodied localization is inherently a problem of embodied 3D spatial reasoning -- yet most existing vision-language systems rely predominantly on RGB inputs, necessitating implicit geometric reconstruction that limits cross-scene generalization, despite the widespread adoption of RGB-D sensors in robotics. To address this gap, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
