DEREC-SIMPRO: unlock Language Model benefits to advance Synthesis in   Data Clean Room

Tung Sum Thomas Kwok; Chi-hua Wang; Guang Cheng

arXiv:2411.00879·cs.DB·November 5, 2024

DEREC-SIMPRO: unlock Language Model benefits to advance Synthesis in Data Clean Room

Tung Sum Thomas Kwok, Chi-hua Wang, Guang Cheng

PDF

Open Access

TL;DR

This paper introduces DEREC-SIMPRO, a pipeline and evaluation metrics that enhance multi-table synthetic data generation for data collaboration, addressing privacy concerns and repeated subject issues.

Contribution

It proposes a novel DEREC pre-processing pipeline and SIMPRO evaluation metrics to improve multi-table synthesizers' performance in data collaboration scenarios.

Findings

01

DEREC improves synthetic data fidelity.

02

Multi-table synthesizers outperform single-table methods.

03

The pipeline enables better data collaboration.

Abstract

Data collaboration via Data Clean Room offers value but raises privacy concerns, which can be addressed through synthetic data and multi-table synthesizers. Common multi-table synthesizers fail to perform when subjects occur repeatedly in both tables. This is an urgent yet unresolved problem, since having both tables with repeating subjects is common. To improve performance in this scenario, we present the DEREC 3-step pre-processing pipeline to generalize adaptability of multi-table synthesizers. We also introduce the SIMPRO 3-aspect evaluation metrics, which leverage conditional distribution and large-scale simultaneous hypothesis testing to provide comprehensive feedback on synthetic data fidelity at both column and table levels. Results show that using DEREC improves fidelity, and multi-table synthesizers outperform single-table counterparts in collaboration settings. Together, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis