WHODUNIT: Evaluation benchmark for culprit detection in mystery stories
Kshitij Gupta

TL;DR
This paper introduces WhoDunIt, a new dataset for testing large language models' deductive reasoning in mystery stories, examining how name variations and prompting styles affect performance.
Contribution
It provides a novel dataset and evaluation framework for assessing LLM deductive reasoning in narrative contexts, including robustness to name substitutions and prompting strategies.
Findings
LLMs perform well on unaltered texts
Accuracy drops with certain name substitutions
Prompting style influences reasoning accuracy
Abstract
We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Gothic Literature and Media Analysis
