The Text Anonymization Benchmark (TAB): A Dedicated Corpus and   Evaluation Framework for Text Anonymization

Ildik\'o Pil\'an; Pierre Lison; Lilja {\O}vrelid; Anthi Papadopoulou,; David S\'anchez; Montserrat Batet

arXiv:2202.00443·cs.CL·July 4, 2022·5 cites

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Ildik\'o Pil\'an, Pierre Lison, Lilja {\O}vrelid, Anthi Papadopoulou,, David S\'anchez, Montserrat Batet

PDF

Open Access 2 Repos 2 Datasets

TL;DR

This paper introduces TAB, a comprehensive benchmark and evaluation framework for text anonymization, featuring an annotated corpus of court cases and metrics to assess privacy protection and utility preservation.

Contribution

It provides the first open-source, annotated corpus specifically designed for evaluating text anonymization methods, along with tailored evaluation metrics.

Findings

01

Baseline models demonstrate varying effectiveness in privacy protection.

02

The benchmark enables systematic comparison of anonymization techniques.

03

Evaluation metrics reveal trade-offs between privacy and utility.

Abstract

We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared to previous work,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Privacy, Security, and Data Protection · Freedom of Expression and Defamation