The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

Simon Henniger; Gabriel Poesia

arXiv:2602.17831·cs.AI·May 19, 2026

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

Simon Henniger, Gabriel Poesia

PDF

TL;DR

The paper introduces The Token Games, an innovative framework where language models challenge each other with self-created puzzles, enabling cost-effective, dynamic evaluation of reasoning and creativity without human curation.

Contribution

It presents a novel self-play evaluation method using puzzle duels and Elo ratings, reducing reliance on human-crafted questions and enabling comprehensive reasoning assessment.

Findings

01

10 frontier models evaluated with results matching existing benchmarks

02

Creating high-quality puzzles remains a significant challenge for current models

03

The framework enables testing reasoning, creativity, and task creation skills

Abstract

Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Artificial Intelligence in Healthcare and Education