Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models
Philipp Mondorf, Barbara Plank

TL;DR
This paper introduces the TruthQuest benchmark to evaluate large language models' ability to perform suppositional reasoning using knights and knaves puzzles, revealing significant challenges and reasoning errors in current models.
Contribution
The paper presents a new benchmark, TruthQuest, for assessing suppositional reasoning in large language models, highlighting their difficulties and error patterns.
Findings
Large language models struggle with knights and knaves puzzles.
Lower-performing models often fail to understand truth and lies.
Proficient models mainly struggle with logical implications of false statements.
Abstract
Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character's identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from individual statements, but the ability to assess the truthfulness of statements by reasoning through various hypothetical scenarios. As such, knights and knaves puzzles serve as compelling examples of suppositional reasoning. In this paper, we introduce , a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques
MethodsLLaMA
