World Simulation with Video Foundation Models for Physical AI

NVIDIA: Arslan Ali; Junjie Bai; Maciej Bala; Yogesh Balaji; Aaron Blakeman; Tiffany Cai; Jiaxin Cao; Tianshi Cao; Elizabeth Cha; Yu-Wei Chao; Prithvijit Chattopadhyay; Mike Chen; Yongxin Chen; Yu Chen; Shuai Cheng; Yin Cui; Jenna Diamond; Yifan Ding; Jiaojiao Fan; Linxi Fan; Liang Feng; Francesco Ferroni; Sanja Fidler; Xiao Fu; Ruiyuan Gao; Yunhao Ge; Jinwei Gu; Aryaman Gupta; Siddharth Gururani; Imad El Hanafi; Ali Hassani; Zekun Hao; Jacob Huffman; Joel Jang; Pooya Jannaty; Jan Kautz; Grace Lam; Xuan Li; Zhaoshuo Li; Maosheng Liao; Chen-Hsuan Lin; Tsung-Yi Lin; Yen-Chen Lin; Huan Ling; Ming-Yu Liu; Xian Liu; Yifan Lu; Alice Luo; Qianli Ma; Hanzi Mao; Kaichun Mo; Seungjun Nah; Yashraj Narang; Abhijeet Panaskar; Lindsey Pavao; Trung Pham; Morteza Ramezanali; Fitsum Reda; Scott Reed; Xuanchi Ren; Haonan Shao; Yue Shen; Stella Shi; Shuran Song; Bartosz Stefaniak; Shangkun Sun; Shitao Tang; Sameena Tasmeen; Lyne Tchapmi; Wei-Cheng Tseng; Jibin Varghese; Andrew Z. Wang; Hao Wang; Haoxiang Wang; Heng Wang; Ting-Chun Wang; Fangyin Wei; Jiashu Xu; Dinghao Yang; Xiaodong Yang; Haotian Ye; Seonghyeon Ye; Xiaohui Zeng; Jing Zhang; Qinsheng Zhang; Kaiwen Zheng; Andrew Zhu; Yuke Zhu

arXiv:2511.00062·cs.CV·February 26, 2026

World Simulation with Video Foundation Models for Physical AI

NVIDIA: Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan

PDF

Open Access 5 Models

TL;DR

This paper introduces Cosmos-Predict2.5, a unified flow-based model for multi-modal world simulation that improves video quality and instruction alignment, enabling advanced applications in Physical AI and robotics.

Contribution

The paper presents Cosmos-Predict2.5, a novel unified model integrating Text2World, Image2World, and Video2World, with enhanced training and control capabilities for Physical AI.

Findings

01

Significant improvements in video quality and instruction alignment over previous models.

02

Successful deployment in synthetic data generation and policy evaluation.

03

Introduction of Cosmos-Transfer2.5 for high-fidelity, long-horizon video translation.

Abstract

We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics