RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization

Songming Liu; Bangguo Li; Kai Ma; Lingxuan Wu; Hengkai Tan; Xiao Ouyang; Hang Su; Jun Zhu

arXiv:2602.03310·cs.RO·February 4, 2026

RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization

Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, Jun Zhu

PDF

Open Access 3 Models 1 Datasets

TL;DR

RDT2 is a large-scale robotic foundation model that enables zero-shot generalization across different robotic embodiments, objects, and tasks by leveraging a novel training approach and extensive open-source datasets.

Contribution

The paper introduces RDT2, a 7B parameter vision-language model trained on a large diverse robotic dataset, enabling zero-shot cross-embodiment generalization in robotics.

Findings

01

RDT2 zero-shot generalizes to unseen objects, scenes, and instructions.

02

Outperforms state-of-the-art in dexterous and dynamic tasks.

03

Uses a novel three-stage training recipe with RVQ, flow-matching, and distillation.

Abstract

Vision-Language-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets--over 10,000 hours of demonstrations in diverse families--using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

robotics-diffusion-transformer/BimanualUR5eExample
dataset· 272 dl
272 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning