Do Contemporary Causal Inference Models Capture Real-World Heterogeneity? Findings from a Large-Scale Benchmark
Haining Yu, Yizhou Sun

TL;DR
This large-scale benchmark reveals that most modern CATE models perform worse than trivial predictors on real-world datasets, highlighting significant challenges and the need for methodological improvements in capturing heterogeneity.
Contribution
The paper introduces a novel observational sampling method and new statistical metrics to evaluate CATE models on real-world data, uncovering their limited effectiveness.
Findings
62% of CATE estimates have higher MSE than zero-effect predictor
80% of datasets with useful CATE estimates still outperform constant-effect models
Orthogonality-based models outperform others only 30% of the time
Abstract
We present unexpected findings from a large-scale benchmark study evaluating Conditional Average Treatment Effect (CATE) estimation algorithms, i.e., CATE models. By running 16 modern CATE models on 12 datasets and 43,200 sampled variants generated through diverse observational sampling strategies, we find that: (a) 62\% of CATE estimates have a higher Mean Squared Error (MSE) than a trivial zero-effect predictor, rendering them ineffective; (b) in datasets with at least one useful CATE estimate, 80\% still have higher MSE than a constant-effect model; and (c) Orthogonality-based models outperform other models only 30\% of the time, despite widespread optimism about their performance. These findings highlight significant challenges in current CATE models and underscore the need for broader evaluation and methodological improvements. Our findings stem from a novel application of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques
