RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang

TL;DR
RoboRefer is a novel vision-language model designed for robotics that enhances spatial understanding and multi-step reasoning in 3D environments, enabling robots to interact more accurately with complex scenes.
Contribution
The paper introduces RoboRefer, integrating a depth encoder and reinforcement fine-tuning, along with a large-scale dataset and benchmark for improved spatial referring and reasoning in robotics.
Findings
Achieves 89.6% success rate in spatial understanding.
Outperforms baselines by 17.4% on RefSpatial-Bench.
Enables robots to perform long-horizon, dynamic tasks.
Abstract
Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BAAI/RoboBrain2.0-7Bmodel· 380 dl· ♡ 121380 dl♡ 121
- 🤗BAAI/RoboBrain2.0-32Bmodel· 36 dl· ♡ 4336 dl♡ 43
- 🤗Mungert/RoboBrain2.0-7B-GGUFmodel· 35 dl· ♡ 335 dl♡ 3
- 🤗Zhoues/RoboRefer-2B-SFTmodel· 90 dl· ♡ 890 dl♡ 8
- 🤗Zhoues/RoboRefer-2B-Depth-Alignmodel· 6 dl· ♡ 26 dl♡ 2
- 🤗Zhoues/NVILA-2B-Depthmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗Zhoues/NVILA-8B-Depthmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗BAAI/RoboBrain2.0-7B-FP8model· ♡ 1♡ 1
- 🤗BAAI/RoboBrain2.0-7B-W8A16model· 2 dl· ♡ 32 dl♡ 3
- 🤗BAAI/RoboBrain2.0-3Bmodel· 157 dl· ♡ 11157 dl♡ 11
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning
MethodsShrink and Fine-Tune
