Named Entity Swapping for Metadata Anonymization in a Text Corpus

Jan Greve; Lukas Sablica

arXiv:2505.21128·stat.AP·May 28, 2025

Named Entity Swapping for Metadata Anonymization in a Text Corpus

Jan Greve, Lukas Sablica

PDF

Open Access

TL;DR

This paper presents a named entity swapping technique for text corpus anonymization, aiming to prevent metadata disclosure and protect sensitive information from large language models, while maintaining data utility.

Contribution

It introduces a novel, flexible anonymization method using embedding-based entity swapping that balances data utility and privacy in text datasets.

Findings

01

Effective in preventing company name disclosure in earnings call transcripts

02

Allows calibration of anonymization level to balance utility and privacy

03

Utilizes embeddings for flexible and customizable anonymization

Abstract

This work introduces an anonymization scheme for a corpus of texts to safeguard metadata from disclosure. It specifically aims to prevent large language models from identifying metadata associated with texts, thereby avoiding their influence on query responses. The core mechanism is called named entity swapping, a technique inspired by data swapping in statistical disclosure control. Our method randomly selects pairs of semantically similar substrings from different texts based on the similarity of their embedding vectors and interchanges some named entities between them. This prevents certain combinations of named entities from being uniquely associated with the metadata of individual texts. Our approach offers two key advantages. First, it enables users to determine the optimal level of anonymization that balances data utility and data risk through a calibration of several key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Access Control and Trust