Bootstrapping Text Anonymization Models with Distant Supervision

Anthi Papadopoulou; Pierre Lison; Lilja {\O}vrelid; Ildik\'o Pil\'an

arXiv:2205.06895·cs.CL·May 17, 2022·1 cites

Bootstrapping Text Anonymization Models with Distant Supervision

Anthi Papadopoulou, Pierre Lison, Lilja {\O}vrelid, Ildik\'o Pil\'an

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method to train text anonymization models using distant supervision from knowledge graphs, eliminating the need for manual labeling and enabling scalable privacy-preserving text processing.

Contribution

It presents a novel approach that leverages knowledge graphs for automatic annotation to bootstrap text anonymization models without manual data labeling.

Findings

01

Effective training of anonymization models using knowledge graph annotations.

02

Challenges due to noise and incompleteness in knowledge graphs.

03

Multiple valid anonymization solutions can exist for the same text.

Abstract

We propose a novel method to bootstrap text anonymization models based on distant supervision. Instead of requiring manually labeled training data, the approach relies on a knowledge graph expressing the background information assumed to be publicly available about various individuals. This knowledge graph is employed to automatically annotate text documents including personal data about a subset of those individuals. More precisely, the method determines which text spans ought to be masked in order to guarantee $k$ -anonymity, assuming an adversary with access to both the text documents and the background information expressed in the knowledge graph. The resulting collection of labeled documents is then used as training data to fine-tune a pre-trained language model for text anonymization. We illustrate this approach using a knowledge graph extracted from Wikidata and short biographical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anthipapa/textanonymization
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Hate Speech and Cyberbullying Detection