TL;DR
KWBench is a new benchmark designed to evaluate large language models on their ability to recognize complex professional problems from raw data without prompts, emphasizing unprompted problem recognition in knowledge work.
Contribution
It introduces KWBench, the first benchmark focusing on unprompted problem recognition in LLMs, with diverse tasks and a novel scoring rubric to assess understanding before solving.
Findings
Top model passes 27.9% of tasks
Models agree only 31.7% on passed tasks
Routing across top models covers 50.7% of tasks
Abstract
We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
