The Token Games: Evaluating Language Model Reasoning with Puzzle Duels
Simon Henniger, Gabriel Poesia

TL;DR
The paper introduces The Token Games, an innovative framework where language models challenge each other with self-created puzzles, enabling cost-effective, dynamic evaluation of reasoning and creativity without human curation.
Contribution
It presents a novel self-play evaluation method using puzzle duels and Elo ratings, reducing reliance on human-crafted questions and enabling comprehensive reasoning assessment.
Findings
10 frontier models evaluated with results matching existing benchmarks
Creating high-quality puzzles remains a significant challenge for current models
The framework enables testing reasoning, creativity, and task creation skills
Abstract
Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling · Artificial Intelligence in Healthcare and Education
