Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
Justin Waugh

TL;DR
Pencil Puzzle Bench is a new benchmark for evaluating large language models' multi-step reasoning and verification abilities using a diverse set of constraint-satisfaction puzzles with deterministic, step-level validation.
Contribution
It introduces a comprehensive puzzle benchmark with verified solutions and intermediate state checks, enabling detailed evaluation of reasoning effort and iterative problem-solving in language models.
Findings
GPT-5.2 scales reasoning effort by 81x with increased effort.
Claude Opus 4.6 improves from 0.3% to 30.0% success with iteration.
Longest agentic reasoning attempts exceed 1,200 turns and 14 hours.
Abstract
We introduce Pencil Puzzle Bench, a framework for evaluating large language model reasoning through pencil puzzles, a family of constraint-satisfaction problems closely related to NP-complete problems, with deterministic, step-level verification. From a database of 62,231 puzzles across 94 varieties with verified unique solutions, we select a benchmark of 300 puzzles spanning 20 varieties and evaluate 51 models from 11 providers in two modes: direct ask (single-shot) and agentic (multi-turn with iterative verification). A key differentiator of our benchmark is that every intermediate board state can be checked against variety-specific constraints, localizing errors to the exact rule violated, providing the infrastructure for dense, per-move reward signals for process supervision and reinforcement learning. Our evaluation reveals two distinct axes of capability: (1) reasoning effort…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Natural Language Processing Techniques · AI-based Problem Solving and Planning
