PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

Pritesh Jha

arXiv:2604.15776·cs.CL·April 20, 2026

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

Pritesh Jha

PDF

1 Repo 1 Datasets

TL;DR

PIIBench is a comprehensive, unified benchmark corpus for PII detection, consolidating multiple datasets and providing a challenging evaluation platform for existing systems.

Contribution

The paper introduces PIIBench, a large, standardized PII dataset with a normalization pipeline and baseline evaluations, addressing fragmentation and difficulty in PII detection benchmarks.

Findings

01

All evaluated systems achieved span-level F1 below 0.14.

02

The best system (Presidio) had an F1 of 0.1385 and zero recall on most entity types.

03

PIIBench is more challenging than existing single-source PII datasets.

Abstract

We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pritesh-2711/pii-bench
github

Datasets

Pritesh-2711/pii-bench
dataset· 347 dl
347 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.