PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Hengzhi Li; Justin Zhang; Brendon Jiang; Alexander Naehu; Regan Song; Megan Tjandrasuwita; Chanakya Ekbote; Steven-Shine Chen; Adithya Balachandran; Wei Dai; Rebecca Chang; Paul Pu Liang

arXiv:2506.06211·cs.CL·April 22, 2026

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Hengzhi Li, Justin Zhang, Brendon Jiang, Alexander Naehu, Regan Song, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang

PDF

1 Repo 1 Datasets 1 Video

TL;DR

PuzzleWorld is a new benchmark with 667 multimodal, open-ended puzzles designed to evaluate reasoning capabilities of models, revealing current limitations and guiding future improvements.

Contribution

It introduces PuzzleWorld, a comprehensive benchmark with detailed annotations, and demonstrates the challenges faced by state-of-the-art models in open-ended reasoning tasks.

Findings

01

Most models achieve only 1-4% final answer accuracy.

02

The best model solves 18% of puzzles and reaches 40% stepwise accuracy.

03

Fine-tuning on reasoning traces improves stepwise accuracy from 4% to 11%.

Abstract

Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions and constrained environments, puzzlehunts requires discovering the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite progress in foundation models, their performance on open-ended settings remains largely untested. We introduce PuzzleWorld, a comprehensive benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MIT-MI/PuzzleWorld
github

Datasets

hzli1202/PuzzleWorld
dataset· 459 dl
459 dl

Videos

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts· slideslive