TabSCM: A practical Framework for Generating Realistic Tabular Data
Sven Jacob, Bardh Prenkaj, Weijia Shao, Gjergji Kasneci

TL;DR
TabSCM is a practical framework for generating realistic, causally consistent tabular data that outperforms existing methods in fidelity, utility, and privacy, while enabling efficient and interpretable counterfactual analysis.
Contribution
It introduces a causal structure-aware generator for tabular data that combines explicit equations, topological modeling, and diffusion models for improved realism and interpretability.
Findings
Matches or surpasses state-of-the-art in fidelity, utility, and privacy.
Runs up to 583 times faster than diffusion-only models.
Reduces rule-violation rates and supports causal interventions.
Abstract
Most tabular-data generators match marginal statistics yet ignore causal structure, leading downstream models to learn spurious or unfair patterns. We present TabSCM, a mixed-type generator that preserves those causal dependencies. Starting from a Completed Partially Directed Acyclic Graph (CPDAG) found by any causal structure discovery algorithm, TabSCM (i) orients edges to a DAG, (ii) fits root-node marginals with KDE or categorical frequencies, and (iii) learns topologically ordered structural assignments. Such assignments are achieved using conditional diffusion models for continuous variables as child nodes and gradient-boosted trees for categorical ones. Ancestral sampling yields semantically valid records and enables exact counterfactual queries. On seven public datasets, encompassing healthcare, finance, housing, environment, TabSCM matches or surpasses state-of-the-art GAN,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
