Toward sensitive document release with privacy guarantees
David S\'anchez, Montserrat Batet

TL;DR
This paper introduces a flexible semantic privacy model for automatic text document sanitization, balancing privacy guarantees with data utility, and provides scalable algorithms validated through empirical experiments.
Contribution
It proposes the (C, g(C))-sanitization model, enhancing previous semantic privacy models with flexible privacy-utility trade-offs and scalable algorithms.
Findings
Improved privacy-utility balance in document sanitization
Efficient algorithms for scalable implementation
Empirical validation demonstrating practical accuracy
Abstract
Privacy has become a serious concern for modern Information Societies. The sensitive nature of much of the data that are daily exchanged or released to untrusted parties requires that responsible organizations undertake appropriate privacy protection measures. Nowadays, much of these data are texts (e.g., emails, messages posted in social media, healthcare outcomes, etc.) that, because of their unstructured and semantic nature, constitute a challenge for automatic data protection methods. In fact, textual documents are usually protected manually, in a process known as document redaction or sanitization. To do so, human experts identify sensitive terms (i.e., terms that may reveal identities and/or confidential information) and protect them accordingly (e.g., via removal or, preferably, generalization). To relieve experts from this burdensome task, in a previous work we introduced the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
