Re-identification of De-identified Documents with Autoregressive Infilling

Lucas Georges Gabriel Charpentier; Pierre Lison

arXiv:2505.12859·cs.CL·May 20, 2025

Re-identification of De-identified Documents with Autoregressive Infilling

Lucas Georges Gabriel Charpentier, Pierre Lison

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel re-identification method that uses autoregressive infilling to recover masked personal information in de-identified documents, revealing potential privacy vulnerabilities.

Contribution

It presents a RAG-inspired approach combining retrieval and infilling models to reverse de-identification, demonstrating high re-identification success rates across various datasets.

Findings

01

Up to 80% of masked spans successfully recovered.

02

Re-identification accuracy improves with more background knowledge.

03

Effective across diverse document types like biographies, court rulings, and clinical notes.

Abstract

Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Re-identification of De-identified Documents with Autoregressive Infilling· underline

Taxonomy

TopicsTopic Modeling · Authorship Attribution and Profiling · Advanced Graph Neural Networks