RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Enshen Zhou; Cheng Chi; Yibo Li; Jingkun An; Jiayuan Zhang; Shanyu Rong; Yi Han; Yuheng Ji; Mengzhen Liu; Pengwei Wang; Zhongyuan Wang; Lu Sheng; Shanghang Zhang

arXiv:2512.13660·cs.RO·January 7, 2026

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang

PDF

Open Access 1 Datasets

TL;DR

RoboTracer is a novel 3D-aware vision-language model designed for robotic spatial reasoning, capable of multi-step metric-grounded reasoning and spatial measurement, trained on a large-scale dataset and benchmarked with state-of-the-art performance.

Contribution

The paper introduces RoboTracer, a 3D-aware VLM with a universal spatial encoder and reinforcement fine-tuning, along with the TraceSpatial dataset and benchmark for complex spatial reasoning tasks.

Findings

01

RoboTracer achieves 79.1% success rate in spatial understanding tasks.

02

It surpasses previous models by 36% accuracy on TraceSpatial-Bench.

03

It can be integrated with control policies for real-world robotic tasks.

Abstract

Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

JingkunAn/TraceSpatial-Bench
dataset· 588 dl
588 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robotics and Sensor-Based Localization