DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

Yang Bai; Liudi Yang; George Eskandar; Fengyi Shen; Mohammad Altillawi; Ziyuan Liu; Gitta Kutyniok

arXiv:2512.14217·cs.CV·December 17, 2025

DRAW2ACT: Turning Depth-Encoded Trajectories into Robotic Demonstration Videos

Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok

PDF

Open Access

TL;DR

DRAW2ACT is a depth-aware, trajectory-conditioned video generation framework that produces controllable, consistent robotic demonstration videos by integrating multiple modalities and depth information, improving manipulation success.

Contribution

It introduces a novel depth-aware, multimodal diffusion model for generating controllable robotic videos and a policy conditioned on these videos for improved manipulation.

Findings

01

Achieves higher visual fidelity and consistency.

02

Yields higher manipulation success rates.

03

Outperforms existing baselines.

Abstract

Video diffusion models provide powerful real-world simulators for embodied AI but remain limited in controllability for robotic manipulation. Recent works on trajectory-conditioned video generation address this gap but often rely on 2D trajectories or single modality conditioning, which restricts their ability to produce controllable and consistent robotic demonstrations. We present DRAW2ACT, a depth-aware trajectory-conditioned video generation framework that extracts multiple orthogonal representations from the input trajectory, capturing depth, semantics, shape and motion, and injects them into the diffusion model. Moreover, we propose to jointly generate spatially aligned RGB and depth videos, leveraging cross-modality attention mechanisms and depth supervision to enhance the spatio-temporal consistency. Finally, we introduce a multimodal policy model conditioned on the generated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation