True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3   and Challenging for GPT-4

Maksym Del; Mark Fishel

arXiv:2212.10114·cs.CL·June 5, 2023

True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4

Maksym Del, Mark Fishel

PDF

Open Access

TL;DR

This paper introduces a challenging benchmark of long-form detective puzzles to evaluate the deep reasoning capabilities of large language models, revealing current models' significant limitations compared to human performance.

Contribution

It presents a novel, difficult reasoning benchmark with long narratives, demonstrating the current limitations of GPT-3 and GPT-4 in solving complex puzzles.

Findings

01

GPT-3 barely outperforms random chance with 28% accuracy

02

GPT-4 achieves only 38% accuracy, far below human success rates

03

The benchmark exposes significant gaps in LLMs' reasoning abilities

Abstract

Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities, which is reflected in their performance on the current test tasks. This calls for a more challenging benchmark requiring highly advanced reasoning ability to be solved. In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles. Puzzles are sourced from the "5 Minute Mystery" platform and include a multiple-choice question for evaluation. Only 47% of humans solve a puzzle successfully on average, while the best human solvers achieve over 80% success rate. We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles. This indicates that there is still a significant gap in the deep reasoning abilities of LLMs and humans and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education

MethodsMulti-Head Attention · Attention Is All You Need · Test · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · GPT-3 · Cosine Annealing · Linear Layer · Layer Normalization · Adam