TL;DR
CRAFT introduces a video diffusion framework that generates diverse, realistic bimanual robot demonstrations from limited real data, enhancing policy robustness and generalization in manipulation tasks.
Contribution
It presents a novel diffusion-based method conditioned on structural cues to produce scalable, diverse, and physically plausible robot demonstration videos from simulation data.
Findings
CRAFT improves success rates over existing augmentation methods.
It enables large-scale, diverse demonstration generation from few real examples.
The approach enhances generalization in both simulated and real bimanual tasks.
Abstract
Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
