Bootstrapping Text Anonymization Models with Distant Supervision
Anthi Papadopoulou, Pierre Lison, Lilja {\O}vrelid, Ildik\'o Pil\'an

TL;DR
This paper introduces a method to train text anonymization models using distant supervision from knowledge graphs, eliminating the need for manual labeling and enabling scalable privacy-preserving text processing.
Contribution
It presents a novel approach that leverages knowledge graphs for automatic annotation to bootstrap text anonymization models without manual data labeling.
Findings
Effective training of anonymization models using knowledge graph annotations.
Challenges due to noise and incompleteness in knowledge graphs.
Multiple valid anonymization solutions can exist for the same text.
Abstract
We propose a novel method to bootstrap text anonymization models based on distant supervision. Instead of requiring manually labeled training data, the approach relies on a knowledge graph expressing the background information assumed to be publicly available about various individuals. This knowledge graph is employed to automatically annotate text documents including personal data about a subset of those individuals. More precisely, the method determines which text spans ought to be masked in order to guarantee -anonymity, assuming an adversary with access to both the text documents and the background information expressed in the knowledge graph. The resulting collection of labeled documents is then used as training data to fine-tune a pre-trained language model for text anonymization. We illustrate this approach using a knowledge graph extracted from Wikidata and short biographical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Hate Speech and Cyberbullying Detection
