Action Images: End-to-End Policy Learning via Multiview Video Generation

Haoyu Zhen; Zixian Gao; Qiao Sun; Yilin Zhao; Yuncong Yang; Yilun Du; Pengsheng Guo; Tsun-Hsuan Wang; Yi-Ling Qiao; and Chuang Gan

arXiv:2604.06168·cs.CV·April 16, 2026

Action Images: End-to-End Policy Learning via Multiview Video Generation

Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Pengsheng Guo, Tsun-Hsuan Wang, Yi-Ling Qiao, and Chuang Gan

PDF

TL;DR

This paper introduces Action Images, a unified approach that uses multiview video generation with pixel-grounded action images for improved robot policy learning and transfer across environments.

Contribution

It proposes translating robot actions into interpretable multi-view videos, enabling zero-shot policy learning directly from video backbones without separate control modules.

Findings

01

Achieves the strongest zero-shot success rates on RLBench and real-world tasks.

02

Improves video-action joint generation quality over prior models.

03

Supports multiple functionalities like control, joint generation, and action labeling within a shared model.

Abstract

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.