RoboTransfer: Controllable Geometry-Consistent Video Diffusion for Manipulation Policy Transfer
Liu Liu, Xiaofeng Wang, Guosheng Zhao, Keyu Li, Wenkang Qin, Jiagang Zhu, Jiaxiong Qiu, Zheng Zhu, Guan Huang, Zhizhong Su

TL;DR
RoboTransfer is a diffusion-based framework that synthesizes geometrically consistent robotic videos with fine control, improving data quality for training manipulation policies and enhancing their generalization in diverse environments.
Contribution
It introduces a novel diffusion model leveraging 3D geometry and cross-view features for high-quality, controllable robotic video synthesis to aid policy transfer.
Findings
Videos have superior geometric consistency and visual fidelity.
Policies trained on RoboTransfer data generalize better to unseen scenarios.
The method enables fine-grained control over scene elements.
Abstract
The goal of general-purpose robotics is to create agents that can seamlessly adapt to and operate in diverse, unstructured human environments. Imitation learning has become a key paradigm for robotic manipulation, yet collecting large-scale and diverse demonstrations is prohibitively expensive. Simulators provide a cost-effective alternative, but the sim-to-real gap remains a major obstacle to scalability. We present RoboTransfer, a diffusion-based video generation framework for synthesizing robotic data. By leveraging cross-view feature interactions and globally consistent 3D geometry, RoboTransfer ensures multi-view geometric consistency while enabling fine-grained control over scene elements, such as background editing and object replacement. Extensive experiments demonstrate that RoboTransfer produces videos with superior geometric consistency and visual fidelity. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
