SpatialPoint: Spatial-aware Point Prediction for Embodied Localization

Qiming Zhu; Zhirui Fang; Tianming Zhang; Chuanxiu Liu; Xiaoke Jiang; Lei Zhang

arXiv:2603.26690·cs.RO·March 31, 2026

SpatialPoint: Spatial-aware Point Prediction for Embodied Localization

Qiming Zhu, Zhirui Fang, Tianming Zhang, Chuanxiu Liu, Xiaoke Jiang, Lei Zhang

PDF

TL;DR

SpatialPoint is a novel spatial-aware vision-language framework that enhances embodied localization by integrating structured depth information, enabling robots to predict precise 3D points for interaction and navigation tasks.

Contribution

The paper introduces SpatialPoint, a new model that incorporates depth into vision-language systems for improved 3D spatial reasoning in embodied localization.

Findings

01

Incorporating depth significantly improves localization accuracy.

02

Constructed a 2.6M RGB-D dataset for training and evaluation.

03

Validated on real robots across grasping, placement, and navigation tasks.

Abstract

Embodied intelligence fundamentally requires a capability to determine where to act in 3D space. We formalize this requirement as embodied localization -- the problem of predicting executable 3D points conditioned on visual observations and language instructions. We instantiate embodied localization with two complementary target types: touchable points, surface-grounded 3D points enabling direct physical interaction, and air points, free-space 3D points specifying placement and navigation goals, directional constraints, or geometric relations. Embodied localization is inherently a problem of embodied 3D spatial reasoning -- yet most existing vision-language systems rely predominantly on RGB inputs, necessitating implicit geometric reconstruction that limits cross-scene generalization, despite the widespread adoption of RGB-D sensors in robotics. To address this gap, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.