Health Data in an Open World

Chris Culnane; Benjamin I. P. Rubinstein; Vanessa Teague

arXiv:1712.05627·cs.CY·December 18, 2017·33 cites

Health Data in an Open World

Chris Culnane, Benjamin I. P. Rubinstein, Vanessa Teague

PDF

Open Access

TL;DR

This paper demonstrates that re-identification of individuals in open health datasets is feasible using minimal information, highlighting privacy risks and the impact of data perturbation and auxiliary datasets.

Contribution

It provides an empirical analysis of re-identification risks in open health data and explores how auxiliary datasets increase re-identification accuracy.

Findings

01

Re-identification is possible with minimal data points.

02

Perturbing data reduces re-identification success but at utility cost.

03

Auxiliary datasets significantly improve re-identification confidence.

Abstract

With the aim of informing sound policy about data sharing and privacy, we describe successful re-identification of patients in an Australian de-identified open health dataset. As in prior studies of similar datasets, a few mundane facts often suffice to isolate an individual. Some people can be identified by name based on publicly available information. Decreasing the precision of the unit-record level data, or perturbing it statistically, makes re-identification gradually harder at a substantial cost to utility. We also examine the value of related datasets in improving the accuracy and confidence of re-identification. Our re-identifications were performed on a 10% sample dataset, but a related open Australian dataset allows us to infer with high confidence that some individuals in the sample have been correctly re-identified. Finally, we examine the combination of the open datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Ethics in Clinical Research · Data Quality and Management