Beyond De-Identification: A Structured Approach for Defining and   Detecting Indirect Identifiers in Medical Texts

Ibrahim Baroud; Lisa Raithel; Sebastian M\"oller; Roland Roller

arXiv:2502.13342·cs.CL·February 20, 2025

Beyond De-Identification: A Structured Approach for Defining and Detecting Indirect Identifiers in Medical Texts

Ibrahim Baroud, Lisa Raithel, Sebastian M\"oller, Roland Roller

PDF

Open Access

TL;DR

This paper presents a structured schema for identifying indirect identifiers in medical texts, aiming to enhance privacy protection by improving de-identification techniques for unstructured clinical data.

Contribution

It introduces a nine-category schema for indirect identifiers, annotates a large dataset, and provides baseline models to detect these identifiers in medical texts.

Findings

01

Annotated 6,199 indirect identifiers in MIMIC-III summaries

02

Proposed baseline models for identifying indirect identifiers

03

Released annotation guidelines and dataset for future research

Abstract

Sharing sensitive texts for scientific purposes requires appropriate techniques to protect the privacy of patients and healthcare personnel. Anonymizing textual data is particularly challenging due to the presence of diverse unstructured direct and indirect identifiers. To mitigate the risk of re-identification, this work introduces a schema of nine categories of indirect identifiers designed to account for different potential adversaries, including acquaintances, family members and medical staff. Using this schema, we annotate 100 MIMIC-III discharge summaries and propose baseline models for identifying indirect identifiers. We will release the annotation guidelines, annotation spans (6,199 annotations in total) and the corresponding MIMIC-III document IDs to support further research in this area.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Biomedical Text Mining and Ontologies · Topic Modeling