From Code to Play: Benchmarking Program Search for Games Using Large Language Models

Manuel Eberhardinger; James Goodman; Alexander Dockhorn; Diego Perez-Liebana; Raluca D. Gaina; Duygu \c{C}akmak; Setareh Maghsudi; Simon Lucas

arXiv:2412.04057·cs.AI·July 16, 2025

From Code to Play: Benchmarking Program Search for Games Using Large Language Models

Manuel Eberhardinger, James Goodman, Alexander Dockhorn, Diego Perez-Liebana, Raluca D. Gaina, Duygu \c{C}akmak, Setareh Maghsudi, Simon Lucas

PDF

Open Access

TL;DR

This paper evaluates large language models' ability to generate usable code for various game-related tasks in Python and Java, using an evolutionary hill-climbing approach, and finds that model performance varies more by task than size.

Contribution

It introduces a framework combining LLMs with evolutionary algorithms to synthesize game code across multiple tasks and languages, highlighting the importance of model diversity over size.

Findings

01

Model performance varies more by task than by size.

02

Larger models produce more executable code but not necessarily higher quality.

03

Using multiple models and selecting the best results improves reliability.

Abstract

Large language models (LLMs) have shown impressive capabilities in generating program code, opening exciting opportunities for applying program synthesis to games. In this work, we explore the potential of LLMs to directly synthesize usable code for a wide range of gaming applications, focusing on two programming languages, Python and Java. We use an evolutionary hill-climbing algorithm, where the mutations and seeds of the initial programs are controlled by LLMs. For Python, the framework covers various game-related tasks, including five miniature versions of Atari games, ten levels of Baba is You, an environment inspired by Asteroids, and a maze generation task. For Java, the framework contains 12 games from the TAG tabletop games framework. Across 29 tasks, we evaluated 12 language models for Python and 8 for Java. Our findings suggest that the performance of LLMs depends more on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Games and Gamification · Digital Games and Media · Artificial Intelligence in Games