Two-Stage Data Synthesization: A Statistics-Driven Restricted Trade-off between Privacy and Prediction

Xiaotong Liu; Shao-Bo Lin; Jun Fan; Ding-Xuan Zhou

arXiv:2602.08657·cs.LG·February 10, 2026

Two-Stage Data Synthesization: A Statistics-Driven Restricted Trade-off between Privacy and Prediction

Xiaotong Liu, Shao-Bo Lin, Jun Fan, Ding-Xuan Zhou

PDF

Open Access

TL;DR

This paper introduces a novel two-stage data synthesis method that balances privacy and prediction accuracy by combining statistical techniques with a hybrid approach, validated through theoretical analysis and real-world applications.

Contribution

The paper proposes a two-stage synthesis strategy that effectively manages the privacy-prediction trade-off using a hybrid and kernel ridge regression approach, which is a novel combination in data synthesis.

Findings

01

Achieves a restricted privacy-prediction trade-off.

02

Guarantees optimal prediction performance.

03

Demonstrates effectiveness on real-world datasets.

Abstract

Synthetic data have gained increasing attention across various domains, with a growing emphasis on their performance in downstream prediction tasks. However, most existing synthesis strategies focus on maintaining statistical information. Although some studies address prediction performance guarantees, their single-stage synthesis designs make it challenging to balance the privacy requirements that necessitate significant perturbations and the prediction performance that is sensitive to such perturbations. We propose a two-stage synthesis strategy. In the first stage, we introduce a synthesis-then-hybrid strategy, which involves a synthesis operation to generate pure synthetic data, followed by a hybrid operation that fuses the synthetic data with the original data. In the second stage, we present a kernel ridge regression (KRR)-based synthesis strategy, where a KRR model is first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Machine Learning and Data Classification · Adversarial Robustness in Machine Learning