Unsupervised Text Deidentification

John X. Morris; Justin T. Chiu; Ramin Zabih; Alexander M. Rush

arXiv:2210.11528·cs.CL·October 24, 2022

Unsupervised Text Deidentification

John X. Morris, Justin T. Chiu, Ramin Zabih, Alexander M. Rush

PDF

Open Access 1 Repo

TL;DR

This paper introduces an unsupervised method for text deidentification that masks personally-identifying information by leveraging a reidentification model and K-anonymity principles, improving over existing baselines.

Contribution

It presents a novel unsupervised deidentification technique that does not rely on labeled data and effectively reduces identifiable information in text documents.

Findings

01

Outperforms unsupervised baselines in deidentification completeness

02

Removes fewer words while maintaining privacy

03

Eliminates more subtle identifying information than named entity approaches

Abstract

Deidentification seeks to anonymize textual data prior to distribution. Automatic deidentification primarily uses supervised named entity recognition from human-labeled data points. We propose an unsupervised deidentification method that masks words that leak personally-identifying information. The approach utilizes a specially trained reidentification model to identify individuals from redacted personal documents. Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank for the correct profile of the document. To evaluate this approach, we consider the task of deidentifying Wikipedia Biographies, and evaluate using an adversarial reidentification metric. Compared to a set of unsupervised baselines, our approach deidentifies documents more completely while removing fewer words. Qualitatively, we see that the approach eliminates many…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jxmorris12/unsupervised-text-deidentification
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Wikis in Education and Collaboration