GameDevBench: Evaluating Agentic Capabilities Through Game Development

Wayne Chi; Yixiong Fang; Arnav Yayavaram; Siddharth Yayavaram; Seth Karten; Qiuhong Anna Wei; Runkun Chen; Alexander Wang; Valerie Chen; Ameet Talwalkar; Chris Donahue

arXiv:2602.11103·cs.AI·February 12, 2026

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, Chris Donahue

PDF

Open Access

TL;DR

GameDevBench is a new benchmark for evaluating multimodal agents in game development tasks, highlighting current challenges and demonstrating simple feedback mechanisms that improve agent performance.

Contribution

The paper introduces GameDevBench, the first comprehensive benchmark for multimodal game development tasks, and proposes simple feedback methods that enhance agent capabilities.

Findings

01

Agents solve only 54.5% of tasks

02

Task difficulty correlates with multimodal complexity

03

Feedback mechanisms improve agent performance

Abstract

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Multimodal Machine Learning Applications · Speech and dialogue systems