WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

Wenyu Zhang; Guoliang You; Tianlun; Haotian Zhao; Tianshu Zhu; Haoran Wang; Xiaoxuan Tang; Mingyang Dai; Jingnan Gu; Daxiang Dong; and Jianmin Wu

arXiv:2605.17637·cs.AI·May 19, 2026

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

Wenyu Zhang, Guoliang You, Tianlun, Haotian Zhao, Tianshu Zhu, Haoran Wang, Xiaoxuan Tang, Mingyang Dai, Jingnan Gu, Daxiang Dong, and Jianmin Wu

PDF

TL;DR

WebGameBench is a novel benchmark that evaluates coding agents' ability to generate browser-native games from specifications, focusing on the actual application runtime and usability.

Contribution

It introduces WebGameBench, the first requirement-to-application benchmark for browser-native games, with a real browser runtime evaluator and human-aligned usability labels.

Findings

01

Best system achieves 76.9% usable rate but only 20.2% excellent.

02

WebGameBench effectively distinguishes current coding systems.

03

Crossing minimal playable threshold remains a challenge.

Abstract

Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.