CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

Mete Ismayilzada; Renqing Cuomao; Daniil Yurshevich; Anna Sotnikova; Lonneke van der Plas; Antoine Bosselut

arXiv:2604.03374·cs.CL·April 7, 2026

CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

Mete Ismayilzada, Renqing Cuomao, Daniil Yurshevich, Anna Sotnikova, Lonneke van der Plas, Antoine Bosselut

PDF

1 Datasets

TL;DR

CresOWLve is a new benchmark designed to evaluate large language models' ability to solve real-world, creative problems that require integrating diverse knowledge and thinking strategies.

Contribution

It introduces a realistic, knowledge-grounded benchmark for assessing creative problem-solving in LLMs, addressing limitations of previous artificial benchmarks.

Findings

01

Models perform better on factual questions than creative ones, with up to 17% performance drop.

02

Models can retrieve relevant knowledge but struggle with forming creative connections.

03

CresOWLve remains highly challenging for current LLMs.

Abstract

Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mismayil/cresowlve
dataset· 116 dl
116 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.