CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
Ayush Sawarni, Jiyuan Tan, Vasilis Syrgkanis

TL;DR
CausalReasoningBenchmark is a comprehensive real-world benchmark that evaluates both causal identification and estimation, enabling detailed diagnosis of causal inference systems and highlighting current limitations.
Contribution
It introduces a novel benchmark with structured identification and estimation tasks, facilitating granular evaluation of causal reasoning in automated systems.
Findings
LLM correctly identifies high-level strategy in 79% of cases
Full identification specification accuracy drops to 34%
The bottleneck is in research design details, not computation.
Abstract
Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification - formulating a valid research design under stated assumptions - and estimation - implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 132 real-world datasets, curated from 79 peer-reviewed research papers and three widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
