VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Yichao Shen; Fangyun Wei; Zhiying Du; Yaobo Liang; Yan Lu; Jiaolong Yang; Nanning Zheng; Baining Guo

arXiv:2512.06963·cs.RO·December 9, 2025

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo

PDF

Open Access

TL;DR

VideoVLA leverages large video generation models to enable robots to predict actions and visual outcomes, significantly improving their ability to generalize across new tasks, objects, and environments in manipulation tasks.

Contribution

This work introduces VideoVLA, a novel approach that transforms video generation models into robotic manipulators capable of joint action and visual outcome prediction for better generalization.

Findings

01

High-quality visual imagination correlates with successful manipulation.

02

VideoVLA demonstrates strong generalization to new objects and tasks.

03

The dual-prediction strategy enhances robot learning and adaptability.

Abstract

Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI