Theory-Grounded Evaluation of Human-Like Fallacy Patterns in LLM Reasoning

Andrew Keenan Richardson; Ryan Othniel Kearns; Sean Moss; Vincent Wang-Mascianica; Philipp Koralus

arXiv:2506.11128·cs.CL·March 24, 2026

Theory-Grounded Evaluation of Human-Like Fallacy Patterns in LLM Reasoning

Andrew Keenan Richardson, Ryan Othniel Kearns, Sean Moss, Vincent Wang-Mascianica, Philipp Koralus

PDF

Open Access 3 Reviews

TL;DR

This paper evaluates whether language models' reasoning errors align with human fallacy patterns using the Erotetic Theory of Reasoning, revealing insights into model capabilities and error types through a novel, theory-grounded testing pipeline.

Contribution

It introduces PyETR, an open-source framework for generating and analyzing reasoning problems based on a cognitive theory, linking model errors to human fallacy patterns.

Findings

01

Higher model capability correlates with more fallacy-like errors.

02

Reversing premise order reduces fallacy production in models.

03

PyETR enables contamination-resistant, theory-based reasoning analysis.

Abstract

We study logical reasoning in language models by asking whether their errors follow established human fallacy patterns. Using the Erotetic Theory of Reasoning (ETR) and its open-source implementation, PyETR, we programmatically generate 383 formally specified reasoning problems and evaluate 38 models. For each response, we judge logical correctness and, when incorrect, whether it matches an ETR-predicted fallacy. Two results stand out: (i) as a capability proxy (Chatbot Arena Elo) increases, a larger share of a model's incorrect answers are ETR-predicted fallacies $(ρ = 0.360, p = 0.0265)$ , while overall correctness on this dataset shows no correlation with capability; (ii) reversing premise order significantly reduces fallacy production for many models, mirroring human order effects. Methodologically, PyETR provides an open-source pipeline for unbounded, synthetic,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. I resonate well with evaluating the human-likeness of reasoning of LLMs - LLMs are trained with trillions of tokens that are human written, and it is expected that human cognitive biases should show in their generations. And in that sence I find the paper quite interesting 2. The paper also brings in the research question of predictability of errors. And shows that more capable model are more predictable as well 3. Authors have done a good job at addressing a bigger audience than the ones

Weaknesses

Weakness: 1. My first and major concern is the scope of the work. I think the section 4 needs more insights. I feel the scope is too limited in its current form. 2. I think the paper can benefit from some anecdotal analysis, i.e. if the authors could give more insights of the form "the reasoning pattern in model X is like Y, and hence …." instead of a high level correlation metric reported.

Reviewer 02Rating 4Confidence 4

Strengths

1. This paper introduces a mature method for evaluating human fallacies, which has a solid theoretical foundation and provides good reproducibility based on the open-source pyETR implementation. 2. Extensive experimental validation was conducted on a wide range of models (of different sizes and architectures), and the results are quite comprehensive. 3. Insightful experimental designs explored how human thought patterns manifest in large-scale models.

Weaknesses

1. Experiments indicate that the moderately low correlation weakens the experimental claims put forward in the paper. 2. The paper lacks theoretical and principled analysis, focusing instead on assessing the similarity between LLMs and humans in reasoning errors and presenting observed experimental results. However, it fails to explore potential causes for this phenomenon or propose solutions to the problem, resulting in limited practical applicability.

Reviewer 03Rating 8Confidence 3

Strengths

I find the results and analyses to be sound, interesting, and worth sharing. * interesting and important result about how models interpret disjunctive normal forms in ways similar to humans * the use of the ETR model is novel and the analyses are sound I also appreciate that the authors have addressed my previous concerns, copied here: * The Chatbot Arena may not be the best proxy for "strong" models, given that human preferences on the arena may be prone to surface features like formatting (w

Weaknesses

No substantial concerns

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI)