RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Enshen Zhou; Jingkun An; Cheng Chi; Yi Han; Shanyu Rong; Chi Zhang; Pengwei Wang; Zhongyuan Wang; Tiejun Huang; Lu Sheng; Shanghang Zhang

arXiv:2506.04308·cs.RO·January 6, 2026

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang

PDF

Open Access 10 Models 3 Datasets

TL;DR

RoboRefer is a novel vision-language model designed for robotics that enhances spatial understanding and multi-step reasoning in 3D environments, enabling robots to interact more accurately with complex scenes.

Contribution

The paper introduces RoboRefer, integrating a depth encoder and reinforcement fine-tuning, along with a large-scale dataset and benchmark for improved spatial referring and reasoning in robotics.

Findings

01

Achieves 89.6% success rate in spatial understanding.

02

Outperforms baselines by 17.4% on RefSpatial-Bench.

03

Enables robots to perform long-horizon, dynamic tasks.

Abstract

Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning

MethodsShrink and Fine-Tune