Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

Jin Du; Li Chen; Xun Xian; An Luo; Fangqiao Tian; Ganghua Wang; Charles Doss; Xiaotong Shen; Jie Ding

arXiv:2505.13770·cs.AI·May 13, 2026

Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

Jin Du, Li Chen, Xun Xian, An Luo, Fangqiao Tian, Ganghua Wang, Charles Doss, Xiaotong Shen, Jie Ding

PDF

1 Video

TL;DR

This paper introduces CausalPitfalls, a benchmark to evaluate large language models' ability to handle complex statistical causal inference challenges, highlighting current limitations.

Contribution

It presents a comprehensive, multi-level benchmark with grading rubrics and two evaluation protocols to rigorously assess LLMs' causal reasoning and reliability.

Findings

01

Current LLMs show significant limitations in causal inference tasks.

02

The benchmark enables quantitative measurement of causal reasoning capabilities.

03

Code-assisted prompting improves model performance in explicit statistical analysis.

Abstract

Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson's paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Ice Cream Doesn’t Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference· slideslive