Synthesizing Realistic Test Data without Breaking Privacy

Laura Plein; Alexi Turcotte; Arina Hallemans; Andreas Zeller

arXiv:2602.05833·cs.LG·February 6, 2026

Synthesizing Realistic Test Data without Breaking Privacy

Laura Plein, Alexi Turcotte, Arina Hallemans, Andreas Zeller

PDF

Open Access

TL;DR

This paper introduces a novel method inspired by GANs to generate synthetic test data that maintains the statistical properties of original datasets while enhancing privacy and reducing vulnerability to attacks.

Contribution

The authors propose a privacy-preserving data generation approach using a test generator and discriminator, which indirectly leverages original data to produce high-utility synthetic datasets.

Findings

01

High utility of synthetic data demonstrated on four datasets

02

Reduced vulnerability to membership inference and reconstruction attacks

03

Comparable statistical properties to original datasets

Abstract

There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation. However, the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we explore the feasibility of producing a synthetic test dataset with the same statistical properties as the original one, with only indirectly leveraging the original data in the generation process. The approach is inspired by GANs, with a generation step and a discrimination step. However, in our approach, we use a test generator (a fuzzer) to produce test data from an input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning