CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion

Liudi Yang; Yang Bai; George Eskandar; Fengyi Shen; Mohammad Altillawi; Dong Chen; Ziyuan Liu; Abhinav Valada

arXiv:2512.16023·cs.CV·December 19, 2025

CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion

Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Ziyuan Liu, Abhinav Valada

PDF

Open Access

TL;DR

CoVAR introduces a novel multi-modal diffusion framework that generates synchronized video-action pairs from text instructions, enhancing robotic policy learning by leveraging large-scale video data.

Contribution

It extends pretrained video diffusion models with a dedicated action model and cross-modal mechanisms, enabling high-quality video and action generation for robotics.

Findings

01

Outperforms existing methods in video quality and action accuracy

02

Effective cross-modal interaction improves generation fidelity

03

Scalable framework demonstrated on multiple benchmarks

Abstract

We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning