EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models
Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, Guanghui Ren

TL;DR
EWMBench is a new benchmark framework designed to evaluate embodied world models by assessing scene, motion, and semantic quality, addressing the need for physically grounded and action-consistent AI generated scenes.
Contribution
The paper introduces EWMBench, a comprehensive evaluation framework with a curated dataset and tools to assess embodied world models beyond perceptual metrics.
Findings
Existing models often lack physical grounding.
EWMBench effectively identifies model limitations.
Benchmark guides future embodied AI development.
Abstract
Recent advances in creative AI have enabled the synthesis of high-fidelity images and videos conditioned on language instructions. Building on these developments, text-to-video diffusion models have evolved into embodied world models (EWMs) capable of generating physically plausible scenes from language commands, effectively bridging vision and action in embodied AI applications. This work addresses the critical challenge of evaluating EWMs beyond general perceptual metrics to ensure the generation of physically grounded and action-consistent behaviors. We propose the Embodied World Model Benchmark (EWMBench), a dedicated framework designed to evaluate EWMs based on three key aspects: visual scene consistency, motion correctness, and semantic alignment. Our approach leverages a meticulously curated dataset encompassing diverse scenes and motion patterns, alongside a comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
