CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

Ayush Sawarni; Jiyuan Tan; Vasilis Syrgkanis

arXiv:2602.20571·cs.AI·May 15, 2026

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

Ayush Sawarni, Jiyuan Tan, Vasilis Syrgkanis

PDF

1 Repo 1 Datasets

TL;DR

CausalReasoningBenchmark is a comprehensive real-world benchmark that evaluates both causal identification and estimation, enabling detailed diagnosis of causal inference systems and highlighting current limitations.

Contribution

It introduces a novel benchmark with structured identification and estimation tasks, facilitating granular evaluation of causal reasoning in automated systems.

Findings

01

LLM correctly identifies high-level strategy in 79% of cases

02

Full identification specification accuracy drops to 34%

03

The bottleneck is in research design details, not computation.

Abstract

Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification - formulating a valid research design under stated assumptions - and estimation - implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 132 real-world datasets, curated from 79 peer-reviewed research papers and three widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/CausalReasoningBenchmark
github

Datasets

syrgkanislab/CausalReasoningBenchmark
dataset· 649 dl
649 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.