GuessBench: Sensemaking Multimodal Creativity in the Wild

Zifeng Zhu; Shangbin Feng; Herun Wan; Ningnan Wang; Minnan Luo; Yulia Tsvetkov

arXiv:2506.00814·cs.CL·June 9, 2025

GuessBench: Sensemaking Multimodal Creativity in the Wild

Zifeng Zhu, Shangbin Feng, Herun Wan, Ningnan Wang, Minnan Luo, Yulia Tsvetkov

PDF

Open Access 4 Reviews

TL;DR

GuessBench is a new benchmark for evaluating vision-language models on their ability to understand and interpret human creativity in noisy, real-world scenarios, using data from a multiplayer Minecraft game.

Contribution

This paper introduces GuessBench, a challenging new benchmark for sensemaking creativity in vision-language models, with curated gameplay data and extensive evaluation of existing models.

Findings

01

GPT-4o achieves 66% accuracy on GuessBench

02

Fine-tuning improves visual perception tasks by 15.36%

03

Performance drops for concepts in underrepresented cultures and languages

Abstract

We propose GuessBench, a novel benchmark that evaluates Vision Language Models (VLMs) on modeling the pervasive, noisy, and pluralistic human creativity. GuessBench sources data from "Guess the Build", an online multiplayer Minecraft minigame where one player constructs a Minecraft build given a concept (e.g. caterpillar) and others try to guess it with natural language hints, presenting a pristine testbed for sensemaking creativity in the wild with VLMs acting as guessers. We curate 1500 images from the actual gameplay and design 2000 problems spanning static and dynamic image settings, natural language hints of varying completeness, and more. Extensive experiments with six open/API VLMs and five reasoning enhancement approaches demonstrate that GuessBench presents a uniquely challenging task in creativity modeling: even the start-of-the-art GPT-4o is incorrect on 34% of instances,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 2

Strengths

- **Originality:** The paper targets sensemaking creativity "in the wild" using a social game where non-expert builders convey concepts imperfectly. This is a fresh angle: decoding noisy, pluralistic, personalized human creativity. The two-stage static/dynamic design requiring guesses under evolving visual + symbolic hints is well thought out. Figure 1 illustrates the task clearly with partial-letter hints and successive build images. - **Data curation efforts:** The dataset is described with co

Weaknesses

- **What exactly is “creativity” here?** The paper motivates creativity in the wild and sensemaking creativity, but the operationalization reduces to guessing a noun/phrase under partial info. This aligns more naturally with VQA-style answer prediction and visual abductive reasoning under noisy inputs than with divergent creativity per se. To substantiate the “creativity” claim, please (i) define which creativity facets are intended (e.g., ambiguity resolution, imaginative reconstruction, robust

Reviewer 02Rating 2Confidence 4

Strengths

Clear motivation to probe compositional/creative reasoning under a constrained, reproducible environment. Evaluation breadth is good (3 commercial + 6 OSS models) with parameters disclosed—this is appreciated for reproducibility. Analysis sections (e.g., language variation, multiviews) are potentially useful diagnostics beyond a single leaderboard.

Weaknesses

1. Overclaiming “in-the-wild.” Calling a controlled video-game environment “in-the-wild” feels overstated. “In-game creative behavior” or “player-generated scenes” would be more accurate and avoids suggesting real-world capture. Please recalibrate the framing. 2. Scale and positioning. At ~1.5k images/~2k problems, the benchmark is small by current VLM standards. That’s not disqualifying, but it argues for positioning this as a diagnostic suite rather than a general benchmark, and for strong le

Reviewer 03Rating 4Confidence 3

Strengths

- Unique framing of creativity in the wild and sensemaking tasks. - Rigorous evaluation across model families and reasoning modes. - Thorough analysis of bias, difficulty, and transfer. - Dataset and metrics well documented.

Weaknesses

- Motivation for Minecraft domain could be further developed. Why not drawn or rendered images? - The human baseline setup could be described in more detail. - Contribution feels a bit more incremental than conceptual. Minor comment: - Some related work that could be relevant: Villareale, Jennifer, et al. "INNk: A multi-player game to deceive a neural network." Extended Abstracts of the 2020 Annual Symposium on Computer-Human Interaction in Play. 2020. - Minor typo: “start-of-the-art” → “state-

Reviewer 04Rating 6Confidence 3

Strengths

1. Novelty and Importance of the Task: The paper's core contribution is shifting the evaluation focus from "creative generation" to "creative understanding." This is a critical and overlooked area. As VLMs are increasingly integrated into collaborative tools, their ability to understand user intent and "half-baked" creative ideas from non-expert users is paramount. 1. Ecological Validity of the Data Source: Using a real and popular online game like Minecraft as a data source is a significant adv

Weaknesses

1. Scale of the Dataset: The core dataset consists of 500 carefully curated build sets , with 424 unique answers (see Table 1 ). While the authors have clearly prioritized quality and manual curation, this is a relatively small number for a benchmark. This small scale might limit the statistical significance of some findings. 2. Simulation of "Dynamic" Task: The authors acknowledge this limitation, but it is a key one. The "dynamic" task is simulated using three static images from the build proc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Storytelling and Education