Measuring General Intelligence with Generated Games

Vivek Verma; David Huang; William Chen; Dan Klein; Nicholas Tomlin

arXiv:2505.07215·cs.AI·May 13, 2025

Measuring General Intelligence with Generated Games

Vivek Verma, David Huang, William Chen, Dan Klein, Nicholas Tomlin

PDF

Open Access 1 Repo

TL;DR

This paper introduces gg-bench, a novel, dynamically generated game environment benchmark for evaluating general reasoning in language models, using LLMs to create, implement, and test games with reinforcement learning agents.

Contribution

The paper presents gg-bench, a flexible benchmark that generates new game environments via LLMs, enabling ongoing evaluation of reasoning capabilities in language models.

Findings

01

State-of-the-art LLMs achieve 7-9% winrate on gg-bench.

02

Reasoning models achieve 31-36% winrate.

03

gg-bench is challenging and supports future research.

Abstract

We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vivek3141/gg-bench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics