Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

Yanjie He

arXiv:2604.10511·cs.AI·April 14, 2026

Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

Yanjie He

PDF

TL;DR

This study evaluates how large language models perform in policy evaluation tasks, revealing that their reasoning is heavily influenced by intuitiveness and that they struggle with counter-intuitive cases, despite possessing relevant knowledge.

Contribution

The paper introduces a benchmark of 40 policy evaluation cases and analyzes LLM performance, highlighting the influence of intuitiveness and the limitations of current reasoning capabilities.

Findings

01

Chain-of-thought prompting improves obvious case performance but not counter-intuitive cases

02

Intuitiveness explains more variance in performance than model or prompting strategy

03

Models' familiarity with knowledge does not correlate with accuracy in counter-intuitive cases

Abstract

Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness -- whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 2,400 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is nearly eliminated on counter-intuitive ones (interaction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.