TL;DR
This paper establishes formal privacy guarantees for text de-identification methods using differential privacy and compares their impact on machine learning task performance, highlighting the robustness of word-by-word replacement strategies.
Contribution
It introduces formal differential privacy guarantees for text transformations and evaluates their utility in NLP tasks, comparing simple redaction and deep learning-based replacements.
Findings
Word-by-word replacement maintains task performance better.
Differential privacy guarantees can be formalized for text de-identification.
Sophisticated replacement methods outperform redaction in privacy-utility trade-offs.
Abstract
Machine Learning approaches to Natural Language Processing tasks benefit from a comprehensive collection of real-life user data. At the same time, there is a clear need for protecting the privacy of the users whose data is collected and processed. For text collections, such as, e.g., transcripts of voice interactions or patient records, replacing sensitive parts with benign alternatives can provide de-identification. However, how much privacy is actually guaranteed by such text transformations, and are the resulting texts still useful for machine learning? In this paper, we derive formal privacy guarantees for general text transformation-based de-identification methods on the basis of Differential Privacy. We also measure the effect that different ways of masking private information in dialog transcripts have on a subsequent machine learning task. To this end, we formulate different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
