Two-Stage Data Synthesization: A Statistics-Driven Restricted Trade-off between Privacy and Prediction
Xiaotong Liu, Shao-Bo Lin, Jun Fan, Ding-Xuan Zhou

TL;DR
This paper introduces a novel two-stage data synthesis method that balances privacy and prediction accuracy by combining statistical techniques with a hybrid approach, validated through theoretical analysis and real-world applications.
Contribution
The paper proposes a two-stage synthesis strategy that effectively manages the privacy-prediction trade-off using a hybrid and kernel ridge regression approach, which is a novel combination in data synthesis.
Findings
Achieves a restricted privacy-prediction trade-off.
Guarantees optimal prediction performance.
Demonstrates effectiveness on real-world datasets.
Abstract
Synthetic data have gained increasing attention across various domains, with a growing emphasis on their performance in downstream prediction tasks. However, most existing synthesis strategies focus on maintaining statistical information. Although some studies address prediction performance guarantees, their single-stage synthesis designs make it challenging to balance the privacy requirements that necessitate significant perturbations and the prediction performance that is sensitive to such perturbations. We propose a two-stage synthesis strategy. In the first stage, we introduce a synthesis-then-hybrid strategy, which involves a synthesis operation to generate pure synthetic data, followed by a hybrid operation that fuses the synthetic data with the original data. In the second stage, we present a kernel ridge regression (KRR)-based synthesis strategy, where a KRR model is first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Machine Learning and Data Classification · Adversarial Robustness in Machine Learning
