Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning
Thatchawin Leelawat, Lewis D Griffin

TL;DR
This paper introduces Watson & Holmes, a naturalistic reasoning benchmark based on a detective game, to evaluate AI reasoning in realistic narrative contexts, showing significant performance improvements over time.
Contribution
It presents a novel, scalable benchmark using a detective game with automated grading, enabling comparison of human and AI reasoning in naturalistic scenarios.
Findings
AI performance improved from lower quartile to top 5% over nine months.
Performance decline observed on longer cases with 1900-4000 words.
Early-stage reasoning models excelled at inductive reasoning with limited evidence.
Abstract
Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts. We present an adaptation of the Watson & Holmes detective tabletop game as a new benchmark designed to evaluate reasoning performance using incrementally presented narrative evidence, open-ended questions and unconstrained language responses. An automated grading system was developed and validated against human assessors to enable scalable and replicable performance evaluation. Results show a clear improvement in AI model performance over time. Over nine months of 2025, model performance rose from the lower quartile of the human comparison group to approximately the top 5%. Around half of this improvement reflects steady advancement across successive model releases, while the remainder corresponds to a marked step change associated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Games · Multimodal Machine Learning Applications
