RAT-Bench: A Comprehensive Benchmark for Text Anonymization

Nata\v{s}a Kr\v{c}o; Zexi Yao; Matthieu Meeus; Yves-Alexandre de Montjoye

arXiv:2602.12806·cs.CL·February 16, 2026

RAT-Bench: A Comprehensive Benchmark for Text Anonymization

Nata\v{s}a Kr\v{c}o, Zexi Yao, Matthieu Meeus, Yves-Alexandre de Montjoye

PDF

Open Access

TL;DR

RAT-Bench is a new benchmark for evaluating text anonymization tools based on re-identification risk, revealing current limitations and guiding future improvements for privacy-preserving language data handling.

Contribution

The paper introduces RAT-Bench, a comprehensive benchmark for assessing text anonymization tools' effectiveness against re-identification risks across multiple languages and domains.

Findings

01

LLM-based anonymizers offer better privacy-utility trade-offs.

02

Current tools struggle with non-standard and indirect identifiers.

03

LLM anonymizers perform well across languages but are computationally intensive.

Abstract

Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Authorship Attribution and Profiling