Toward sensitive document release with privacy guarantees

David S\'anchez; Montserrat Batet

arXiv:1701.00436·cs.CR·January 3, 2017

Toward sensitive document release with privacy guarantees

David S\'anchez, Montserrat Batet

PDF

TL;DR

This paper introduces a flexible semantic privacy model for automatic text document sanitization, balancing privacy guarantees with data utility, and provides scalable algorithms validated through empirical experiments.

Contribution

It proposes the (C, g(C))-sanitization model, enhancing previous semantic privacy models with flexible privacy-utility trade-offs and scalable algorithms.

Findings

01

Improved privacy-utility balance in document sanitization

02

Efficient algorithms for scalable implementation

03

Empirical validation demonstrating practical accuracy

Abstract

Privacy has become a serious concern for modern Information Societies. The sensitive nature of much of the data that are daily exchanged or released to untrusted parties requires that responsible organizations undertake appropriate privacy protection measures. Nowadays, much of these data are texts (e.g., emails, messages posted in social media, healthcare outcomes, etc.) that, because of their unstructured and semantic nature, constitute a challenge for automatic data protection methods. In fact, textual documents are usually protected manually, in a process known as document redaction or sanitization. To do so, human experts identify sensitive terms (i.e., terms that may reveal identities and/or confidential information) and protect them accordingly (e.g., via removal or, preferably, generalization). To relieve experts from this burdensome task, in a previous work we introduced the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.