PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
Hengzhi Li, Justin Zhang, Brendon Jiang, Alexander Naehu, Regan Song, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang

TL;DR
PuzzleWorld is a new benchmark with 667 multimodal, open-ended puzzles designed to evaluate reasoning capabilities of models, revealing current limitations and guiding future improvements.
Contribution
It introduces PuzzleWorld, a comprehensive benchmark with detailed annotations, and demonstrates the challenges faced by state-of-the-art models in open-ended reasoning tasks.
Findings
Most models achieve only 1-4% final answer accuracy.
The best model solves 18% of puzzles and reaches 40% stepwise accuracy.
Fine-tuning on reasoning traces improves stepwise accuracy from 4% to 11%.
Abstract
Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions and constrained environments, puzzlehunts requires discovering the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite progress in foundation models, their performance on open-ended settings remains largely untested. We introduce PuzzleWorld, a comprehensive benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
