Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

Justin Waugh

arXiv:2603.02119·cs.AI·March 3, 2026

Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

Justin Waugh

PDF

Open Access 1 Datasets

TL;DR

Pencil Puzzle Bench is a new benchmark for evaluating large language models' multi-step reasoning and verification abilities using a diverse set of constraint-satisfaction puzzles with deterministic, step-level validation.

Contribution

It introduces a comprehensive puzzle benchmark with verified solutions and intermediate state checks, enabling detailed evaluation of reasoning effort and iterative problem-solving in language models.

Findings

01

GPT-5.2 scales reasoning effort by 81x with increased effort.

02

Claude Opus 4.6 improves from 0.3% to 30.0% success with iteration.

03

Longest agentic reasoning attempts exceed 1,200 turns and 14 hours.

Abstract

We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

bluecoconut/pencil-puzzle-bench
dataset· 388 dl
388 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Natural Language Processing Techniques · AI-based Problem Solving and Planning