Rethinking Video Generation Model for the Embodied World
Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou

TL;DR
This paper introduces RBench, a comprehensive benchmark for evaluating robot-oriented video generation, and RoVid-X, a large-scale dataset, to improve the realism and assessment of embodied AI models.
Contribution
It presents a new standardized benchmark and a large annotated dataset to advance the evaluation and training of high-quality robot video generation models.
Findings
Significant deficiencies in current models' physical realism.
High correlation (0.96) between RBench scores and human judgments.
The new dataset enables scalable training of embodied AI models.
Abstract
Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI · Multimodal Machine Learning Applications
