CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion
Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Ziyuan Liu, Abhinav Valada

TL;DR
CoVAR introduces a novel multi-modal diffusion framework that generates synchronized video-action pairs from text instructions, enhancing robotic policy learning by leveraging large-scale video data.
Contribution
It extends pretrained video diffusion models with a dedicated action model and cross-modal mechanisms, enabling high-quality video and action generation for robotics.
Findings
Outperforms existing methods in video quality and action accuracy
Effective cross-modal interaction improves generation fidelity
Scalable framework demonstrated on multiple benchmarks
Abstract
We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning
