TL;DR
This paper introduces CausalPitfalls, a benchmark to evaluate large language models' ability to handle complex statistical causal inference challenges, highlighting current limitations.
Contribution
It presents a comprehensive, multi-level benchmark with grading rubrics and two evaluation protocols to rigorously assess LLMs' causal reasoning and reliability.
Findings
Current LLMs show significant limitations in causal inference tasks.
The benchmark enables quantitative measurement of causal reasoning capabilities.
Code-assisted prompting improves model performance in explicit statistical analysis.
Abstract
Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson's paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
