Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs
Lavender Y. Jiang, Xujin Chris Liu, Kyunghyun Cho, Eric K. Oermann

TL;DR
This paper critiques HIPAA Safe Harbor's de-identification approach, highlighting its inadequacy against modern LLMs that can infer identities from latent correlations in clinical notes, and discusses the implications for patient privacy.
Contribution
It formalizes the limitations of current de-identification standards in the context of LLMs and empirically demonstrates their ability to re-identify patients from de-identified notes.
Findings
LLMs can re-identify patients from de-identified notes
Diagnosis alone can predict patient neighborhoods
Safe Harbor de-identification is insufficient against modern AI models
Abstract
Privacy is a human right that sustains patient-provider trust. Clinical notes capture a patient's private vulnerability and individuality, which are used for care coordination and research. Under HIPAA Safe Harbor, these notes are de-identified to protect patient privacy. However, Safe Harbor was designed for an era of categorical tabular data, focusing on the removal of explicit identifiers while ignoring the latent information found in correlations between identity and quasi-identifiers, which can be captured by modern LLMs. We first formalize these correlations using a causal graph, then validate it empirically through individual re-identification of patients from scrubbed notes. The paradox of de-identification is further shown through a diagnosis ablation: even when all other information is removed, the model can predict the patient's neighborhood based on diagnosis alone. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPatient Dignity and Privacy · Privacy-Preserving Technologies in Data · Privacy, Security, and Data Protection
