GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Mingyu Ouyang; Siyuan Hu; Kevin Qinghong Lin; Hwee Tou Ng; Mike Zheng Shou

arXiv:2604.07429·cs.CV·April 10, 2026

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, Mike Zheng Shou

PDF

2 Repos

TL;DR

GameWorld is a benchmark for standardized, verifiable evaluation of multimodal game agents in browser environments, addressing current challenges in heterogeneous interfaces and verification methods.

Contribution

It introduces a comprehensive benchmark with diverse games and tasks, along with state-verifiable metrics, to evaluate multimodal game agents systematically.

Findings

01

Even the best agents lag behind human capabilities.

02

Benchmark reruns show robustness of evaluation results.

03

Studies reveal challenges in real-time interaction and action validity.

Abstract

Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.