GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld Team; Angen Ye; Boyuan Wang; Chaojun Ni; Guan Huang; Guosheng Zhao; Haoyun Li; Jiagang Zhu; Kerui Li; Mengyuan Xu; Qiuping Deng; Siting Wang; Wenkang Qin; Xinze Chen; Xiaofeng Wang; Yankai Wang; Yu Cao; Yifan Chang; Yuan Xu; Yun Ye; Yang Wang; Yukun Zhou; Zhengyuan Zhang; Zhehao Dong; Zheng Zhu

arXiv:2511.19861·cs.CV·December 2, 2025

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou

PDF

Open Access 2 Models

TL;DR

GigaWorld-0 introduces a comprehensive world model framework that generates diverse, realistic, and controllable embodied interaction data, significantly enhancing embodied AI performance without real-world training data.

Contribution

The paper presents GigaWorld-0, a novel unified world model framework combining large-scale video generation and 3D modeling to produce high-quality data for embodied AI training.

Findings

01

Generated data is diverse, realistic, and controllable.

02

VLA models trained on GigaWorld-0 data outperform previous methods.

03

Significant improvements in real-world robot tasks without real-world data.

Abstract

World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis