Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
Zhongru Zhang, Chenghan Yang, Qingzhou Lu, Yanjiang Guo, Jianke Zhang, Yucheng Hu, Jianyu Chen

TL;DR
This paper explores how advanced video generation models like Veo-3 can support generalizable robot manipulation, combining visual prediction with inverse dynamics and hierarchical control to improve task performance.
Contribution
It introduces Veo-Act, a hierarchical framework that leverages high-level video predictions and low-level policies, advancing robot manipulation capabilities.
Findings
Veo-3+IDM can generate approximately correct task trajectories.
Low-level control accuracy of Veo-3+IDM is currently insufficient for most tasks.
Veo-Act significantly improves instruction-following performance.
Abstract
Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this "Veo-3+IDM" approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
