PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
Pritesh Jha

TL;DR
PIIBench is a comprehensive, unified benchmark corpus for PII detection, consolidating multiple datasets and providing a challenging evaluation platform for existing systems.
Contribution
The paper introduces PIIBench, a large, standardized PII dataset with a normalization pipeline and baseline evaluations, addressing fragmentation and difficulty in PII detection benchmarks.
Findings
All evaluated systems achieved span-level F1 below 0.14.
The best system (Presidio) had an F1 of 0.1385 and zero recall on most entity types.
PIIBench is more challenging than existing single-source PII datasets.
Abstract
We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
