WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Xiaojie Xu; Zhengyuan Lin; Kang He; Yukang Feng; Xiaofeng Mao; Yuanyang Yin; Kaipeng Zhang; Yongtao Ge

arXiv:2604.21686·cs.CV·April 24, 2026

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, Yongtao Ge

PDF

TL;DR

WorldMark is a comprehensive benchmark suite for interactive video world models, providing standardized evaluation conditions, a unified control interface, and an online arena for model comparison.

Contribution

It introduces a unified action-mapping layer, a hierarchical test suite, and a modular evaluation toolkit for fair comparison of interactive video models.

Findings

01

Enabled apples-to-apples comparison across six models on identical scenes.

02

Provided a diverse set of 500 evaluation cases covering various viewpoints and scene types.

03

Launched an online platform for live model comparison and benchmarking.

Abstract

Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.