GigaWorld-Policy: An Efficient Action-Centered World--Action Model

Angen Ye; Boyuan Wang; Chaojun Ni; Guan Huang; Guosheng Zhao; Hao Li; Hengtao Li; Jie Li; Jindi Lv; Jingyu Liu; Min Cao; Peng Li; Qiuping Deng; Wenjun Mei; Xiaofeng Wang; Xinze Chen; Xinyu Zhou; Yang Wang; Yifan Chang; Yifan Li; Yukun Zhou; Yun Ye; Zhichao Liu; Zheng Zhu

arXiv:2603.17240·cs.CV·March 24, 2026

GigaWorld-Policy: An Efficient Action-Centered World--Action Model

Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, Min Cao, Peng Li, Qiuping Deng, Wenjun Mei, Xiaofeng Wang, Xinze Chen, Xinyu Zhou, Yang Wang, Yifan Chang, Yifan Li, Yukun Zhou, Yun Ye, Zhichao Liu, Zheng Zhu

PDF

Open Access

TL;DR

GigaWorld-Policy introduces an efficient, action-centered world-action model for robot policy learning that predicts actions and optionally generates videos, significantly improving speed and success rates over existing methods.

Contribution

It proposes a novel action-centered WAM that decouples action and video prediction, enabling faster inference and more accurate, physically plausible robot policies.

Findings

01

Runs 9x faster than Motus baseline

02

Improves task success rates by 7%

03

Enhances performance by 95% on RoboTwin 2.0

Abstract

World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics