Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in   Large Language Models

Philipp Mondorf; Barbara Plank

arXiv:2406.12546·cs.CL·October 10, 2024

Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models

Philipp Mondorf, Barbara Plank

PDF

Open Access 1 Repo 3 Datasets 1 Video

TL;DR

This paper introduces the TruthQuest benchmark to evaluate large language models' ability to perform suppositional reasoning using knights and knaves puzzles, revealing significant challenges and reasoning errors in current models.

Contribution

The paper presents a new benchmark, TruthQuest, for assessing suppositional reasoning in large language models, highlighting their difficulties and error patterns.

Findings

01

Large language models struggle with knights and knaves puzzles.

02

Lower-performing models often fail to understand truth and lies.

03

Proficient models mainly struggle with logical implications of false statements.

Abstract

Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character's identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from individual statements, but the ability to assess the truthfulness of statements by reasoning through various hypothetical scenarios. As such, knights and knaves puzzles serve as compelling examples of suppositional reasoning. In this paper, we introduce $TruthQuest$ , a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mainlp/TruthQuest
noneOfficial

Datasets

Videos

Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models· underline

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques

MethodsLLaMA