TL;DR
GLiNER2-PII is a multilingual, character-level PII detection model trained on synthetic data, achieving state-of-the-art results and publicly available for research and deployment.
Contribution
The paper introduces a small multilingual PII detection model trained on a synthetic corpus, overcoming data scarcity and privacy issues.
Findings
Achieves highest span-level F1 on SPY benchmark among five systems.
Successfully detects 42 PII entity types across multiple languages.
Publicly released on Hugging Face for community use.
Abstract
Reliable detection of personally identifiable information (PII) is increasingly important across modern data-processing systems, yet the task remains difficult: PII spans are heterogeneous, locale-dependent, context-sensitive, and often embedded in noisy or semi-structured documents. We present GLiNER2-PII, a small 0.3B-parameter model adapted from GLiNER2 and designed to recognize a broad taxonomy of 42 PII entity types at character-span resolution. Training such systems, however, is constrained by the scarcity of shareable annotated data and the privacy risks associated with collecting real PII at scale. To address this challenge, we construct a multilingual synthetic corpus of 4,910 annotated texts using a constraint-driven generation pipeline that produces diverse, realistic examples across languages, domains, formats, and entity distributions. On the challenging SPY benchmark,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
