SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources
Zheng Li, Yue Zhao, Jialin Fu

TL;DR
The paper introduces SYNC, a Gaussian copula-based framework for generating high-resolution synthetic data from aggregated low-resolution sources, addressing data collection challenges.
Contribution
It presents a novel multi-stage framework combining machine learning and statistical methods for synthetic data generation from aggregated data sources.
Findings
SYNC accurately captures dependencies and marginals in simulations.
It effectively generates high-resolution data for feature engineering.
The framework is scalable and adaptable to real-world datasets.
Abstract
A synthetic dataset is a data object that is generated programmatically, and it may be valuable to creating a single dataset from multiple sources when direct collection is difficult or costly. Although it is a fundamental step for many data science tasks, an efficient and standard framework is absent. In this paper, we study a specific synthetic data generation task called downscaling, a procedure to infer high-resolution, harder-to-collect information (e.g., individual level records) from many low-resolution, easy-to-collect sources, and propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula). For given low-resolution datasets, the central idea of SYNC is to fit Gaussian copula models to each of the low-resolution datasets in order to correctly capture dependencies and marginal distributions, and then sample from the fitted models to obtain the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · demographic modeling and climate adaptation · Time Series Analysis and Forecasting
