A Comparative Study of Open-Source Libraries for Synthetic Tabular Data Generation: SDV vs. SynthCity
Cristian Del Gobbo

TL;DR
This study compares open-source synthetic tabular data generators from SDV and SynthCity, evaluating their statistical similarity and predictive utility using real-world energy data, highlighting their strengths and usability for machine learning tasks.
Contribution
It provides a comprehensive evaluation of six open-source synthetic data generators, comparing their performance and usability in low-data regimes with real-world datasets.
Findings
Bayesian Network from SynthCity achieved highest data fidelity.
TVAE from SDV performed best in predictive utility under larger data scenarios.
SDV offers better documentation and ease of use for practitioners.
Abstract
High-quality training data is critical to the performance of machine learning models, particularly Large Language Models (LLMs). However, obtaining real, high-quality data can be challenging, especially for smaller organizations and early-stage startups. Synthetic data generators provide a promising solution by replicating the statistical and structural properties of real data while preserving privacy and scalability. This study evaluates the performance of six tabular synthetic data generators from two widely used open-source libraries: SDV (Gaussian Copula, CTGAN, TVAE) and Synthicity (Bayesian Network, CTGAN, TVAE). Using a real-world dataset from the UCI Machine Learning Repository, comprising energy consumption and environmental variables from Belgium, we simulate a low-data regime by training models on only 1,000 rows. Each generator is then tasked with producing synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
