WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

Kshitij Gupta

arXiv:2502.07747·cs.CL·February 12, 2025

WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

Kshitij Gupta

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces WhoDunIt, a new dataset for testing large language models' deductive reasoning in mystery stories, examining how name variations and prompting styles affect performance.

Contribution

It provides a novel dataset and evaluation framework for assessing LLM deductive reasoning in narrative contexts, including robustness to name substitutions and prompting strategies.

Findings

01

LLMs perform well on unaltered texts

02

Accuracy drops with certain name substitutions

03

Prompting style influences reasoning accuracy

Abstract

We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kjgpta/WhoDunIt-Evaluation_benchmark_for_culprit_detection_in_mystery_stories
noneOfficial

Datasets

kjgpta/WhoDunIt
dataset· 363 dl
363 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Gothic Literature and Media Analysis