CasualSynth: Generating Structurally Sound Synthetic Data

Zehua Cheng; Wei Dai; Jiahao Sun; Thomas Lukasiewicz

arXiv:2605.17528·cs.LG·May 19, 2026

CasualSynth: Generating Structurally Sound Synthetic Data

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

PDF

TL;DR

CausalSynth is a novel framework that generates causally valid synthetic data by decoupling causal structure creation from linguistic realization, using an iterative refinement process with large language models.

Contribution

It introduces a three-phase method combining causal skeleton generation, LLM-based realization, and iterative correction to produce causally consistent and linguistically rich synthetic data.

Findings

01

Preserved conditional independencies with near-nominal false-positive rates.

02

Achieved over 96% realizability rate on three causal benchmarks.

03

Reduced bias from LLM priors through iterative correction.

Abstract

Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.