TL;DR
This paper introduces TurtleSoup-Bench, a bilingual interactive benchmark, and Mosaic-Agent, a novel evaluation agent, to assess the imaginative reasoning capabilities of Large Language Models through dynamic, exploratory puzzles.
Contribution
It presents the first large-scale, bilingual benchmark and an agent specifically designed to evaluate LLMs' imaginative reasoning in a dynamic, hypothesis-driven environment.
Findings
LLMs show significant limitations in imaginative reasoning.
Common failure patterns identified in LLMs' reasoning processes.
Performance gap between LLMs and human reasoning capabilities.
Abstract
We investigate the capacity of Large Language Models (LLMs) for imaginative reasoning--the proactive construction, testing, and revision of hypotheses in information-sparse environments. Existing benchmarks, often static or focused on social deduction, fail to capture the dynamic, exploratory nature of this reasoning process. To address this gap, we introduce a comprehensive research framework based on the classic "Turtle Soup" game, integrating a benchmark, an agent, and an evaluation protocol. We present TurtleSoup-Bench, the first large-scale, bilingual, interactive benchmark for imaginative reasoning, comprising 800 turtle soup puzzles sourced from both the Internet and expert authors. We also propose Mosaic-Agent, a novel agent designed to assess LLMs' performance in this setting. To evaluate reasoning quality, we develop a multi-dimensional protocol measuring logical consistency,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
