True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4
Maksym Del, Mark Fishel

TL;DR
This paper introduces a challenging benchmark of long-form detective puzzles to evaluate the deep reasoning capabilities of large language models, revealing current models' significant limitations compared to human performance.
Contribution
It presents a novel, difficult reasoning benchmark with long narratives, demonstrating the current limitations of GPT-3 and GPT-4 in solving complex puzzles.
Findings
GPT-3 barely outperforms random chance with 28% accuracy
GPT-4 achieves only 38% accuracy, far below human success rates
The benchmark exposes significant gaps in LLMs' reasoning abilities
Abstract
Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities, which is reflected in their performance on the current test tasks. This calls for a more challenging benchmark requiring highly advanced reasoning ability to be solved. In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles. Puzzles are sourced from the "5 Minute Mystery" platform and include a multiple-choice question for evaluation. Only 47% of humans solve a puzzle successfully on average, while the best human solvers achieve over 80% success rate. We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles. This indicates that there is still a significant gap in the deep reasoning abilities of LLMs and humans and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education
MethodsMulti-Head Attention · Attention Is All You Need · Test · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · GPT-3 · Cosine Annealing · Linear Layer · Layer Normalization · Adam
