Practical privacy metrics for synthetic data
Gillian M Raab, Beata Nowok, Chris Dibben

TL;DR
This paper introduces new privacy risk measures for synthetic data, implemented in the R package synthpop, focusing on identity and attribute disclosure risks with practical evaluation on real datasets.
Contribution
It extends the synthpop package to include measures of disclosure risk, specifically RepU and DiSCO, for better privacy assessment of synthetic data.
Findings
RepU and DiSCO effectively measure disclosure risks.
Some apparent disclosures are due to known data relationships.
The methods help identify and exclude high-risk disclosures.
Abstract
This paper explains how the synthpop package for R has been extended to include functions to calculate measures of identity and attribute disclosure risk for synthetic data that measure risks for the records used to create the synthetic data. The basic function, disclosure, calculates identity disclosure for a set of quasi-identifiers (keys) and attribute disclosure for one variable specified as a target from the same set of keys. The second function, disclosure.summary, is a wrapper for the first and presents summary results for a set of targets. This short paper explains the measures of disclosure risk and documents how they are calculated. We recommend two measures: (replicated uniques) for identity disclosure and (Disclosive in Synthetic Correct Original) for attribute disclosure. Both are expressed a \% of the original records and each can be compared to similar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Privacy, Security, and Data Protection · Data Quality and Management
