CasualSynth: Generating Structurally Sound Synthetic Data
Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

TL;DR
CausalSynth is a novel framework that generates causally valid synthetic data by decoupling causal structure creation from linguistic realization, using an iterative refinement process with large language models.
Contribution
It introduces a three-phase method combining causal skeleton generation, LLM-based realization, and iterative correction to produce causally consistent and linguistically rich synthetic data.
Findings
Preserved conditional independencies with near-nominal false-positive rates.
Achieved over 96% realizability rate on three causal benchmarks.
Reduced bias from LLM priors through iterative correction.
Abstract
Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
