Rethinking Video Generation Model for the Embodied World

Yufan Deng; Zilin Pan; Hongyu Zhang; Xiaojie Li; Ruoqing Hu; Yufei Ding; Yiming Zou; Yan Zeng; Daquan Zhou

arXiv:2601.15282·cs.CV·January 22, 2026

Rethinking Video Generation Model for the Embodied World

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou

PDF

Open Access 2 Datasets

TL;DR

This paper introduces RBench, a comprehensive benchmark for evaluating robot-oriented video generation, and RoVid-X, a large-scale dataset, to improve the realism and assessment of embodied AI models.

Contribution

It presents a new standardized benchmark and a large annotated dataset to advance the evaluation and training of high-quality robot video generation models.

Findings

01

Significant deficiencies in current models' physical realism.

02

High correlation (0.96) between RBench scores and human judgments.

03

The new dataset enables scalable training of embodied AI models.

Abstract

Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI · Multimodal Machine Learning Applications