Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test

Chun-Kai Fan; Xiaowei Chi; Xiaozhu Ju; Hao Li; Yong Bao; Yu-Kai Wang; Lizhang Chen; Zhiyuan Jiang; Kuangzhi Ge; Ying Li; Weishi Mi; Qingpo Wuwu; Peidong Jia; Yulin Luo; Kevin Zhang; Zhiyuan Qin; Yong Dai; Sirui Han; Yike Guo; Shanghang Zhang; Jian Tang

arXiv:2601.04137·cs.RO·January 8, 2026

Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test

Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, Weishi Mi, Qingpo Wuwu, Peidong Jia, Yulin Luo, Kevin Zhang, Zhiyuan Qin, Yong Dai, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang

PDF

Open Access

TL;DR

This paper introduces the WoW-wo-val benchmark to evaluate the perceptual fidelity, robustness, and generalization of video foundation models for embodied AI, revealing significant gaps between generated videos and real-world performance.

Contribution

It establishes a standardized evaluation framework with 22 metrics for assessing embodied world models, including a human-aligned Turing Test and real-world execution benchmarks.

Findings

01

Models show limited spatiotemporal consistency and physical reasoning.

02

High correlation between model scores and human preferences (>0.93).

03

Most models fail in real-world execution, with success rates around 0%, while WoW achieves 40.74%.

Abstract

As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis