WorldEval: World Model as Real-World Robot Policies Evaluator
Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, Yi Xu

TL;DR
This paper introduces WorldEval, a scalable and reliable framework using world models to evaluate robot manipulation policies efficiently and safely, correlating well with real-world performance and outperforming existing methods.
Contribution
It presents Policy2Vec for realistic video generation from world models and develops WorldEval, an automated online evaluation pipeline for robot policies.
Findings
Strong correlation between WorldEval and real-world performance
WorldEval outperforms real-to-sim evaluation methods
Enables safe, scalable policy assessment in diverse environments
Abstract
The field of robotics has made significant strides toward developing generalist robot manipulation policies. However, evaluating these policies in real-world scenarios remains time-consuming and challenging, particularly as the number of tasks scales and environmental conditions change. In this work, we demonstrate that world models can serve as a scalable, reproducible, and reliable proxy for real-world robot policy evaluation. A key challenge is generating accurate policy videos from world models that faithfully reflect the robot actions. We observe that directly inputting robot actions or using high-dimensional encoding methods often fails to generate action-following videos. To address this, we propose Policy2Vec, a simple yet effective approach to turn a video generation model into a world simulator that follows latent action to generate the robot video. We then introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance
