A Technical Exploration of Causal Inference with Hybrid LLM Synthetic Data
Dana Kim, Yichen Xu, and Tiffany Lin

TL;DR
This paper explores how to generate synthetic data using Large Language Models that accurately preserves causal effects, proposing a hybrid framework to improve causal inference reliability in synthetic datasets.
Contribution
The paper introduces a hybrid data generation framework combining model-based covariate synthesis with causal structure preservation, enhancing causal effect estimation in synthetic data.
Findings
State-of-the-art synthetic generators often misestimate causal effects.
The proposed hybrid framework improves causal effect preservation.
Benchmarking shows better causal inference performance with the new method.
Abstract
Large Language Models (LLMs) offer a flexible means to generate synthetic tabular data, yet existing approaches often fail to preserve key causal parameters such as the average treatment effect (ATE). In this technical exploration, we first demonstrate that state-of-the-art synthetic data generators, both GAN- and LLM-based, can achieve high predictive fidelity while substantially misestimating causal effects. To address this gap, we propose a hybrid generation framework that combines model-based covariate synthesis (monitored via distance-to-closest-record filtering) with separately learned propensity and outcome models, thereby ensuring that (W, A, Y) triplets retain their underlying causal structure. We further introduce a synthetic pairing strategy to mitigate positivity violations and a realistic evaluation protocol that leverages unlimited synthetic samples to benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare
