GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

Yuchen Li; Cong Lin; Muhammad Umair Nasir; Philip Bontrager; Jialin Liu; Julian Togelius

arXiv:2508.08501·cs.AI·May 19, 2026

GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

Yuchen Li, Cong Lin, Muhammad Umair Nasir, Philip Bontrager, Jialin Liu, Julian Togelius

PDF

TL;DR

GVGAI-LLM is a new benchmark using diverse arcade-style games to evaluate large language models' reasoning, problem-solving, and spatial understanding, revealing current limitations and guiding future improvements.

Contribution

It introduces a scalable, game-based benchmark for assessing LLMs' reasoning and spatial skills, with interpretable metrics and the ability to generate infinite test scenarios.

Findings

01

LLMs show persistent spatial and logical errors.

02

Interventions like structured prompting lead to partial improvements.

03

Benchmark reveals significant gaps in LLM reasoning capabilities.

Abstract

We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model's ability to handle tasks that differ from most existing LLM benchmarks. The benchmark leverages a video game description language that enables the rapid creation of new games (including rules and levels), helping to prevent overfitting over time. Each game scene is represented by a compact set of ASCII characters, allowing for efficient processing by language models. GVGAI-LLM defines interpretable metrics, including meaningful step ratio, step efficiency, and overall score, to assess model behavior. Through zero-shot evaluations across 118 games with diverse challenges and skill depth, we reveal persistent limitations of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics