Beyond De-Identification: A Structured Approach for Defining and Detecting Indirect Identifiers in Medical Texts
Ibrahim Baroud, Lisa Raithel, Sebastian M\"oller, Roland Roller

TL;DR
This paper presents a structured schema for identifying indirect identifiers in medical texts, aiming to enhance privacy protection by improving de-identification techniques for unstructured clinical data.
Contribution
It introduces a nine-category schema for indirect identifiers, annotates a large dataset, and provides baseline models to detect these identifiers in medical texts.
Findings
Annotated 6,199 indirect identifiers in MIMIC-III summaries
Proposed baseline models for identifying indirect identifiers
Released annotation guidelines and dataset for future research
Abstract
Sharing sensitive texts for scientific purposes requires appropriate techniques to protect the privacy of patients and healthcare personnel. Anonymizing textual data is particularly challenging due to the presence of diverse unstructured direct and indirect identifiers. To mitigate the risk of re-identification, this work introduces a schema of nine categories of indirect identifiers designed to account for different potential adversaries, including acquaintances, family members and medical staff. Using this schema, we annotate 100 MIMIC-III discharge summaries and propose baseline models for identifying indirect identifiers. We will release the annotation guidelines, annotation spans (6,199 annotations in total) and the corresponding MIMIC-III document IDs to support further research in this area.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Biomedical Text Mining and Ontologies · Topic Modeling
