Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Zhongru Zhang; Chenghan Yang; Qingzhou Lu; Yanjiang Guo; Jianke Zhang; Yucheng Hu; Jianyu Chen

arXiv:2604.04502·cs.RO·April 7, 2026

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Zhongru Zhang, Chenghan Yang, Qingzhou Lu, Yanjiang Guo, Jianke Zhang, Yucheng Hu, Jianyu Chen

PDF

TL;DR

This paper explores how advanced video generation models like Veo-3 can support generalizable robot manipulation, combining visual prediction with inverse dynamics and hierarchical control to improve task performance.

Contribution

It introduces Veo-Act, a hierarchical framework that leverages high-level video predictions and low-level policies, advancing robot manipulation capabilities.

Findings

01

Veo-3+IDM can generate approximately correct task trajectories.

02

Low-level control accuracy of Veo-3+IDM is currently insufficient for most tasks.

03

Veo-Act significantly improves instruction-following performance.

Abstract

Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this "Veo-3+IDM" approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.