Privacy risk from synthetic data: practical proposals

Gillian M Raab

arXiv:2409.04257·stat.AP·May 19, 2025·PSD

Privacy risk from synthetic data: practical proposals

Gillian M Raab

PDF

Open Access

TL;DR

This paper introduces practical measures for assessing privacy risks in synthetic data, helping data custodians decide on data release while identifying and excluding risky records.

Contribution

It proposes and evaluates new disclosure risk measures for synthetic data, with methods implemented in the synthpop R package.

Findings

01

Effective risk measures identified for synthetic data

02

Methods to detect and exclude risky records

03

Insights into disclosure risks from real data sets

Abstract

This paper proposes and compares measures of identity and attribute disclosure risk for synthetic data. Data custodians can use the methods proposed here to inform the decision as to whether to release synthetic versions of confidential data. Different measures are evaluated on two data sets. Insight into the measures is obtained by examining the details of the records identified as posing a disclosure risk. This leads to methods to identify, and possibly exclude, apparently risky records where the identification or attribution would be expected by someone with background knowledge of the data. The methods described are available as part of the \textbf{synthpop} package for \textbf{R}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data