Measuring pattern retention in anonymized data -- where one measure is   not enough

Sam Fletcher; Md Zahidul Islam

arXiv:1512.07721·cs.AI·December 25, 2015

Measuring pattern retention in anonymized data -- where one measure is not enough

Sam Fletcher, Md Zahidul Islam

PDF

Open Access

TL;DR

This paper introduces new measures to evaluate how well anonymized data retains original patterns, highlighting that prediction accuracy alone is insufficient for comprehensive assessment.

Contribution

It proposes a novel methodology and three measures to better evaluate pattern retention in anonymized data, complementing existing accuracy-based metrics.

Findings

01

New measures effectively capture pattern retention

02

Prediction accuracy alone is inadequate for data similarity assessment

03

Methodology enhances evaluation of anonymized data quality

Abstract

In this paper, we explore how modifying data to preserve privacy affects the quality of the patterns discoverable in the data. For any analysis of modified data to be worth doing, the data must be as close to the original as possible. Therein lies a problem -- how does one make sure that modified data still contains the information it had before modification? This question is not the same as asking if an accurate classifier can be built from the modified data. Often in the literature, the prediction accuracy of a classifier made from modified (anonymized) data is used as evidence that the data is similar to the original. We demonstrate that this is not the case, and we propose a new methodology for measuring the retention of the patterns that existed in the original data. We then use our methodology to design three measures that can be easily implemented, each measuring aspects of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Data Mining Algorithms and Applications · Imbalanced Data Classification Techniques