SYNC: A Copula based Framework for Generating Synthetic Data from   Aggregated Sources

Zheng Li; Yue Zhao; Jialin Fu

arXiv:2009.09471·stat.AP·September 22, 2020·6 cites

SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources

Zheng Li, Yue Zhao, Jialin Fu

PDF

Open Access 1 Repo

TL;DR

The paper introduces SYNC, a Gaussian copula-based framework for generating high-resolution synthetic data from aggregated low-resolution sources, addressing data collection challenges.

Contribution

It presents a novel multi-stage framework combining machine learning and statistical methods for synthetic data generation from aggregated data sources.

Findings

01

SYNC accurately captures dependencies and marginals in simulations.

02

It effectively generates high-resolution data for feature engineering.

03

The framework is scalable and adaptable to real-world datasets.

Abstract

A synthetic dataset is a data object that is generated programmatically, and it may be valuable to creating a single dataset from multiple sources when direct collection is difficult or costly. Although it is a fundamental step for many data science tasks, an efficient and standard framework is absent. In this paper, we study a specific synthetic data generation task called downscaling, a procedure to infer high-resolution, harder-to-collect information (e.g., individual level records) from many low-resolution, easy-to-collect sources, and propose a multi-stage framework called SYNC (Synthetic Data Generation via Gaussian Copula). For given low-resolution datasets, the central idea of SYNC is to fit Gaussian copula models to each of the low-resolution datasets in order to correctly capture dependencies and marginal distributions, and then sample from the fitted models to obtain the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

winstonll/SynC
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · demographic modeling and climate adaptation · Time Series Analysis and Forecasting