RAT-Bench: A Comprehensive Benchmark for Text Anonymization
Nata\v{s}a Kr\v{c}o, Zexi Yao, Matthieu Meeus, Yves-Alexandre de Montjoye

TL;DR
RAT-Bench is a new benchmark for evaluating text anonymization tools based on re-identification risk, revealing current limitations and guiding future improvements for privacy-preserving language data handling.
Contribution
The paper introduces RAT-Bench, a comprehensive benchmark for assessing text anonymization tools' effectiveness against re-identification risks across multiple languages and domains.
Findings
LLM-based anonymizers offer better privacy-utility trade-offs.
Current tools struggle with non-standard and indirect identifiers.
LLM anonymizers perform well across languages but are computationally intensive.
Abstract
Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Authorship Attribution and Profiling
