Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

Thatchawin Leelawat; Lewis D Griffin

arXiv:2602.19914·cs.AI·February 24, 2026

Watson & Holmes: A Naturalistic Benchmark for Comparing Human and LLM Reasoning

Thatchawin Leelawat, Lewis D Griffin

PDF

Open Access

TL;DR

This paper introduces Watson & Holmes, a naturalistic reasoning benchmark based on a detective game, to evaluate AI reasoning in realistic narrative contexts, showing significant performance improvements over time.

Contribution

It presents a novel, scalable benchmark using a detective game with automated grading, enabling comparison of human and AI reasoning in naturalistic scenarios.

Findings

01

AI performance improved from lower quartile to top 5% over nine months.

02

Performance decline observed on longer cases with 1900-4000 words.

03

Early-stage reasoning models excelled at inductive reasoning with limited evidence.

Abstract

Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts. We present an adaptation of the Watson & Holmes detective tabletop game as a new benchmark designed to evaluate reasoning performance using incrementally presented narrative evidence, open-ended questions and unconstrained language responses. An automated grading system was developed and validated against human assessors to enable scalable and replicable performance evaluation. Results show a clear improvement in AI model performance over time. Over nine months of 2025, model performance rose from the lower quartile of the human comparison group to approximately the top 5%. Around half of this improvement reflects steady advancement across successive model releases, while the remainder corresponds to a marked step change associated with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Games · Multimodal Machine Learning Applications