CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge
Mete Ismayilzada, Renqing Cuomao, Daniil Yurshevich, Anna Sotnikova, Lonneke van der Plas, Antoine Bosselut

TL;DR
CresOWLve is a new benchmark designed to evaluate large language models' ability to solve real-world, creative problems that require integrating diverse knowledge and thinking strategies.
Contribution
It introduces a realistic, knowledge-grounded benchmark for assessing creative problem-solving in LLMs, addressing limitations of previous artificial benchmarks.
Findings
Models perform better on factual questions than creative ones, with up to 17% performance drop.
Models can retrieve relevant knowledge but struggle with forming creative connections.
CresOWLve remains highly challenging for current LLMs.
Abstract
Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
