Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method
R\'emy Chapelle (CESP, EVDG), Bruno Falissard (CESP)

TL;DR
This paper develops and evaluates a multi-step, distance-based synthetic data generation framework that balances privacy protection with data utility, demonstrated on epidemiological data using formal and empirical assessment tools.
Contribution
The paper introduces a refined, multi-step synthetic data generation framework based on classification trees and distance filtering, with novel formal and empirical evaluation methods.
Findings
High privacy protection against attribute disclosure attacks.
Membership disclosure attacks are formally prevented.
Synthetic data retains high distributional similarity with original data.
Abstract
Introduction: The amount of data generated by original research is growing exponentially. Publicly releasing them is recommended to comply with the Open Science principles. However, data collected from human participants cannot be released as-is without raising privacy concerns. Fully synthetic data represent a promising answer to this challenge. This approach is explored by the French Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in the form of a synthetic data generation framework based on Classification and Regression Trees and an original distance-based filtering. The goal of this work was to develop a refined version of this framework and to assess its risk-utility profile with empirical and formal tools, including novel ones developed for the purpose of this evaluation.Materials and Methods: Our synthesis framework consists of four successive steps,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data-Driven Disease Surveillance · Health, Environment, Cognitive Aging
MethodsFocus
