PuzzleJAX: A Benchmark for Reasoning and Learning

Sam Earle; Graham Todd; Yuchen Li; Ahmed Khalifa; Muhammad Umair Nasir; Zehua Jiang; Andrzej Banburski-Fahey; Julian Togelius

arXiv:2508.16821·cs.AI·August 26, 2025

PuzzleJAX: A Benchmark for Reasoning and Learning

Sam Earle, Graham Todd, Yuchen Li, Ahmed Khalifa, Muhammad Umair Nasir, Zehua Jiang, Andrzej Banburski-Fahey, Julian Togelius

PDF

3 Reviews

TL;DR

PuzzleJAX is a GPU-accelerated engine and language that enables rapid benchmarking of reasoning and learning algorithms across a broad and expressive set of puzzle games, supporting diverse AI research.

Contribution

It introduces a flexible, domain-specific language and engine for dynamic game compilation, expanding benchmarking capabilities beyond fixed game sets.

Findings

01

Demonstrates PuzzleJAX's coverage of hundreds of diverse puzzle games.

02

Shows that tasks vary from simple to deeply challenging, requiring complex reasoning.

03

Validates the platform's utility for evaluating search, learning, and language models.

Abstract

We introduce PuzzleJAX, a GPU-accelerated puzzle game engine and description language designed to support rapid benchmarking of tree search, reinforcement learning, and LLM reasoning abilities. Unlike existing GPU-accelerated learning environments that provide hard-coded implementations of fixed sets of games, PuzzleJAX allows dynamic compilation of any game expressible in its domain-specific language (DSL). This DSL follows PuzzleScript, which is a popular and accessible online game engine for designing puzzle games. In this paper, we validate in PuzzleJAX several hundred of the thousands of games designed in PuzzleScript by both professional designers and casual creators since its release in 2013, thereby demonstrating PuzzleJAX's coverage of an expansive, expressive, and human-relevant space of tasks. By analyzing the performance of search, learning, and language models on these…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

1. Reimplementing PuzzleScript in JAX provides hardware-accelerated simulation, yielding 2×–16× speedups over JavaScript implementations. 2. The system supports automatic generation and compilation of new puzzle rulesets — allowing continuous benchmark expansion and procedural diversity.

Weaknesses

1. PuzzleJAX is a benchmark of over 500 diverse game environments. But the paper does not provide clear metrics or details on how to use PuzzleJAX as a benchmark. As as example if LLM developers were to use PuzzleJAX then how should they go about evaluating their models, what pipeline and parameters are to be used, what metrics should they collect and how should they compare with other LLM based solutions. 2. Comparison is not shown against other gaming benchmarks.

Reviewer 02Rating 2Confidence 4

Strengths

* This paper is clear writing and easy to follow. * Brings a large family of human-designed, tile-based puzzles into a single GPU-friendly framework, avoiding overfitting to one game while keeping action/obs spaces uniform. * Clear speedups vs. the baseline engine (2×–16×), especially at scale; Fig. 2 (p. 4) visualizes throughput gains.

Weaknesses

* Less than half of scraped games validate end-to-end; many levels fail with state/solution errors, which may limit the stability in agentic RLVR * No curated set of PuzzleScript games is released (engine only), which complicates comparability across papers unless the community converges on a shared subset/split.

Reviewer 03Rating 6Confidence 4

Strengths

- The creation of a large-scale and computationally efficient benchmark is an appreciated contribution. The benchmark is a rich, human-relevant space of tasks, avoiding the pitfalls of toy problems. The ability to automatically compile thousands of diverse environments addresses a critical need for testing generalization and avoiding benchmark overfitting. - The reimplementation of the PuzzleScript engine in JAX is valuable and non trivial. The insight to model the rewrite rules as convolutional

Weaknesses

- The RL baseline is PPO with a heuristic reward based on distance-to-win conditions. While a standard choice, it is well-known that such agents struggle with hard-exploration problems and sparse rewards, which many of these puzzles represent. The conclusion that "RL struggles" could be strengthened by including or at least discussing more sophisticated exploration methods (e.g., RND, ICM) or model-based RL approaches (e.g., MuZero-style planning) that integrate search. Without this, the paper p

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.