KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Ankit Maloo

arXiv:2604.15760·cs.AI·April 20, 2026

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Ankit Maloo

PDF

1 Repo

TL;DR

KWBench is a new benchmark designed to evaluate large language models on their ability to recognize complex professional problems from raw data without prompts, emphasizing unprompted problem recognition in knowledge work.

Contribution

It introduces KWBench, the first benchmark focusing on unprompted problem recognition in LLMs, with diverse tasks and a novel scoring rubric to assess understanding before solving.

Findings

01

Top model passes 27.9% of tasks

02

Models agree only 31.7% on passed tasks

03

Routing across top models covers 50.7% of tasks

Abstract

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ankitmaloo/fasteval
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.