Real vs. Semi-Simulated: Rethinking Evaluation for Treatment Effect Estimation
George Panagopoulos

TL;DR
This paper systematically compares treatment effect estimation methods using both semi-simulated benchmarks and real-world datasets, revealing gaps between evaluation regimes and emphasizing the importance of real-data validation.
Contribution
It provides a large-scale empirical study highlighting the discrepancies between counterfactual and observable metrics and their impact on model evaluation.
Findings
Counterfactual metrics do not reliably match observable metrics in model ranking.
Rankings on semi-simulated benchmarks do not transfer well to real datasets.
Simple meta-learners with strong base models perform competitively against specialized causal models.
Abstract
Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industrial practice. However, the two communities often evaluate models under markedly different conditions. Methodological work typically relies on semi-simulated benchmarks and metrics that require counterfactual outcomes, whereas real-world applications rely on observable metrics based on ranking or test outcomes. Despite the well-known gap between methodological progress and practical deployment, the relationship between these evaluation regimes has not been examined systematically. We conduct a large-scale empirical study of treatment effect evaluation across standard semi-simulated benchmark families and real-world datasets. Our benchmark covers meta-learners paired with multiple base learners, as well as specialized causal machine learning models. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
