Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

Jeffrey Seely; Yuki Imajuku; Tianyu Zhao; Edoardo Cetin; Llion Jones

arXiv:2505.16135·cs.AI·May 23, 2025

Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, Llion Jones

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Sudoku-Bench is a new benchmark designed to evaluate creative and multi-step logical reasoning in large language models using challenging Sudoku variants that require novel problem-solving strategies.

Contribution

The paper introduces Sudoku-Bench, a curated set of unconventional Sudoku puzzles that effectively assess creative reasoning and provide tools for broad research application.

Findings

01

State-of-the-art LLMs solve less than 15% of puzzles unaided.

02

Sudoku variants challenge memorization, requiring logical breakthroughs.

03

The benchmark facilitates consistent evaluation of reasoning abilities.

Abstract

Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sakanaai/sudoku-bench
noneOfficial

Datasets

Kanzoet97/Sumo
dataset· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topicsgraph theory and CDMA systems