Health Data in an Open World
Chris Culnane, Benjamin I. P. Rubinstein, Vanessa Teague

TL;DR
This paper demonstrates that re-identification of individuals in open health datasets is feasible using minimal information, highlighting privacy risks and the impact of data perturbation and auxiliary datasets.
Contribution
It provides an empirical analysis of re-identification risks in open health data and explores how auxiliary datasets increase re-identification accuracy.
Findings
Re-identification is possible with minimal data points.
Perturbing data reduces re-identification success but at utility cost.
Auxiliary datasets significantly improve re-identification confidence.
Abstract
With the aim of informing sound policy about data sharing and privacy, we describe successful re-identification of patients in an Australian de-identified open health dataset. As in prior studies of similar datasets, a few mundane facts often suffice to isolate an individual. Some people can be identified by name based on publicly available information. Decreasing the precision of the unit-record level data, or perturbing it statistically, makes re-identification gradually harder at a substantial cost to utility. We also examine the value of related datasets in improving the accuracy and confidence of re-identification. Our re-identifications were performed on a 10% sample dataset, but a related open Australian dataset allows us to infer with high confidence that some individuals in the sample have been correctly re-identified. Finally, we examine the combination of the open datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Ethics in Clinical Research · Data Quality and Management
