WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

Yu Shang; Yinzhou Tang; Yiding Ma; Zhuohang Li; Lei Jin; Weikang Su; Xin Jin; Zhaolu Wang; Ziyou Wang; Xin Zhang; Haisheng Su; Weizhen He; Wei Wu; Haoyi Duan; Gordon Wetzstein; Xihui Liu; Dhruv Shah; Zhaoxiang Zhang; Zhibo Chen; Jun Zhu; Yonghong Tian; Tat-Seng Chua; Wenwu Zhu; Chen Gao; Yong Li

arXiv:2605.17912·cs.RO·May 19, 2026

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

Yu Shang, Yinzhou Tang, Yiding Ma, Zhuohang Li, Lei Jin, Weikang Su, Xin Jin, Zhaolu Wang, Ziyou Wang, Xin Zhang, Haisheng Su, Weizhen He, Wei Wu, Haoyi Duan, Gordon Wetzstein, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong Tian, Tat-Seng Chua, Wenwu Zhu

PDF

1 Repo

TL;DR

WorldArena 2.0 is a comprehensive benchmark for embodied world models, evaluating multimodal perception, interactive utility, and cross-platform performance in both simulated and real-world robotic settings.

Contribution

It extends existing benchmarks by including visuotactile modalities, interactive environment evaluation, and diverse robotic platforms, offering a more holistic assessment of embodied world models.

Findings

01

Extends evaluation from vision-only to multimodal perception.

02

Includes assessment of world models as interactive RL environments.

03

Enables cross-platform performance measurement.

Abstract

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://world-arena.ai
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.