Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation
Yanjie He

TL;DR
This study evaluates how large language models perform in policy evaluation tasks, revealing that their reasoning is heavily influenced by intuitiveness and that they struggle with counter-intuitive cases, despite possessing relevant knowledge.
Contribution
The paper introduces a benchmark of 40 policy evaluation cases and analyzes LLM performance, highlighting the influence of intuitiveness and the limitations of current reasoning capabilities.
Findings
Chain-of-thought prompting improves obvious case performance but not counter-intuitive cases
Intuitiveness explains more variance in performance than model or prompting strategy
Models' familiarity with knowledge does not correlate with accuracy in counter-intuitive cases
Abstract
Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness -- whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 2,400 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is nearly eliminated on counter-intuitive ones (interaction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
