CausalProfiler: Generating Synthetic Benchmarks for Rigorous and Transparent Evaluation of Causal Machine Learning

Panayiotis Panayiotou; Audrey Poinsot; Alessandro Leite; Nicolas Chesneau; Marc Schoenauer; \"Ozg\"ur \c{S}im\c{s}ek

arXiv:2511.22842·cs.LG·January 8, 2026

CausalProfiler: Generating Synthetic Benchmarks for Rigorous and Transparent Evaluation of Causal Machine Learning

Panayiotis Panayiotou, Audrey Poinsot, Alessandro Leite, Nicolas Chesneau, Marc Schoenauer, \"Ozg\"ur \c{S}im\c{s}ek

PDF

Open Access 3 Reviews

TL;DR

CausalProfiler is a synthetic benchmark generator that enables rigorous, transparent evaluation of causal machine learning methods across diverse models, data, and causal reasoning levels, addressing limitations of existing benchmarks.

Contribution

It introduces the first random generator of synthetic causal benchmarks with coverage guarantees and transparent assumptions for comprehensive evaluation of Causal ML methods.

Findings

01

Evaluated several state-of-the-art Causal ML methods under diverse conditions.

02

Demonstrated the utility of CausalProfiler in revealing method strengths and limitations.

03

Enabled analysis both within and outside the identification regime.

Abstract

Causal machine learning (Causal ML) aims to answer "what if" questions using machine learning algorithms, making it a promising tool for high-stakes decision-making. Yet, empirical evaluation practices in Causal ML remain limited. Existing benchmarks often rely on a handful of hand-crafted or semi-synthetic datasets, leading to brittle, non-generalizable conclusions. To bridge this gap, we introduce CausalProfiler, a synthetic benchmark generator for Causal ML methods. Based on a set of explicit design choices about the class of causal models, queries, and data considered, the CausalProfiler randomly samples causal models, data, queries, and ground truths constituting the synthetic causal benchmarks. In this way, Causal ML methods can be rigorously and transparently evaluated under a variety of conditions. This work offers the first random generator of synthetic causal benchmarks with…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- CausalProfiler offers a novel, randomized generator for causal ML evaluation. Unlike static, hand-crafted datasets, it can instantiate diverse causal models and queries. Crucially, it is the first framework to cover all three levels of Pearl’s causal hierarchy (observational, interventional, and counterfactual), allowing for broad algorithm testing against known ground truth. - The framework operates by sampling from a user-defined "Space of Interest" (e.g., graph types, data distributions).

Weaknesses

- The evaluation relies entirely on synthetic data. While this provides ground truth, it fails to demonstrate how insights from CausalProfiler generalize to the complexity and "messiness" of real-world applications. The lack of real or semi-synthetic case studies makes it unclear if the framework's conclusions hold in practice. - The framework's utility is constrained by its initial design assumptions (e.g., graph types, functional forms). It is unclear if the current "space of interest" adequa

Reviewer 02Rating 2Confidence 5

Strengths

Causal inference is an extremely important problem, and developing comprehensive benchmarks to evaluate the performance of related methods under controlled conditions is essential.

Weaknesses

My main concern with this paper is its lack of novelty. Synthetic datasets have long been used to assess causal inference performance. The paper appears to unify existing methods to create a synthetic data generator. However, it still focuses mainly on synthetic data generation, rather than addressing the major benchmarking issue in causal inference: ensuring the realism of the data and assessing performance on real-world datasets. I also highly suggest that the paper be submitted to a benchmark

Reviewer 03Rating 4Confidence 3

Strengths

The authors' focus on evaluation and description of the challenges in evaluation are compelling and motivating. With an increase in focus on finding new ways to produce more realistic datasets for evaluation, the authors' choice to focus on more rigorous synthetic evaluation is a smart one. Any evaluation with empirical data is going to be inherently limited due to the reasons the authors discuss, so synthetic data will continue to be used to explore the range of an algorithm's performance. I

Weaknesses

A lot of the rhetoric in the first half of the paper is overly broad and grandiose. This ends up over-selling CausalProfiler so that, when we actually see how it's used in the Experiments section, it seems to fall short of the promises the paper seemed to make. Really, CausalProfiler seems like a flexible and useful tool, but my expectations were set unrealistically high which, rather than making it seem like a powerful tool, only results in a sort of "this is it?" feeling by the end of the pa

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Explainable Artificial Intelligence (XAI) · Advanced Causal Inference Techniques