Video Generation Models in Robotics -- Applications, Research Challenges, Future Directions

Zhiting Mei; Tenny Yin; Ola Shorinwa; Apurva Badithela; Zhonghe Zheng; Joseph Bruno; Madison Bland; Lihan Zha; Asher Hancock; Jaime Fern\'andez Fisac; Philip Dames; Anirudha Majumdar

arXiv:2601.07823·eess.SY·January 13, 2026

Video Generation Models in Robotics -- Applications, Research Challenges, Future Directions

Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fern\'andez Fisac, Philip Dames, Anirudha Majumdar

PDF

Open Access

TL;DR

This survey reviews the use of video generation models in robotics, highlighting their capabilities in simulating physical interactions, their applications in various learning paradigms, and the challenges faced in deploying them safely and effectively.

Contribution

It provides a comprehensive overview of video models in robotics, discusses current applications, identifies key challenges, and suggests future research directions for safer and more effective integration.

Findings

01

Video models enable photorealistic, physically consistent simulations.

02

They serve as expressive world models for complex physical interactions.

03

Challenges include data costs, hallucinations, and safety concerns.

Abstract

Video generation models have emerged as high-fidelity models of the physical world, capable of synthesizing high-quality videos capturing fine-grained interactions between agents and their environments conditioned on multi-modal user inputs. Their impressive capabilities address many of the long-standing challenges faced by physics-based simulators, driving broad adoption in many problem domains, e.g., robotics. For example, video models enable photorealistic, physically consistent deformable-body simulation without making prohibitive simplifying assumptions, which is a major bottleneck in physics-based simulation. Moreover, video models can serve as foundation world models that capture the dynamics of the world in a fine-grained and expressive way. They thus overcome the limited expressiveness of language-only abstractions in describing intricate physical interactions. In this survey,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications