Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities
Yichen Xu

TL;DR
This paper examines the limitations of current synthetic data generation methods for causal inference and proposes a hybrid approach that improves the preservation of causal effects, especially the average treatment effect (ATE).
Contribution
It formalizes the structural mismatch in synthetic data for causal inference and introduces a hybrid synthesis framework that enhances causal fidelity over fully generative models.
Findings
Hybrid synthesis improves ATE preservation compared to fully generative models.
LLM-based hybrid synthesis often outperforms CTGAN in causal fidelity.
The framework enables synthetic simulation for benchmarking causal estimators.
Abstract
Synthetic tabular data are often evaluated by distributional similarity, privacy distance, or train-on-synthetic-test-on-real predictive performance, but these criteria do not ensure validity for causal inference. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can preserve predictive utility while distorting average treatment effect (ATE) estimates. The failure is structural: ATE preservation requires both a realistic covariate law and an accurate treatment-effect contrast, whereas prediction loss penalizes treatment-effect error only through an overlap-weighted term. We formalize this mismatch through sensitivity and loss-decomposition results, and identify an analogous decomposition in block-level next-token prediction under log loss. Motivated by the tabular causal analysis, we propose a hybrid synthetic-data framework that generates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
