Development and validation of a natural language processing algorithm to   pseudonymize documents in the context of a clinical data warehouse

Xavier Tannier; Perceval Wajsb\"urt; Alice Calliger; Basile Dura,; Alexandre Mouchet; Martin Hilka; Romain Bey

arXiv:2303.13451·cs.CL·March 24, 2023·1 cites

Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse

Xavier Tannier, Perceval Wajsb\"urt, Alice Calliger, Basile Dura,, Alexandre Mouchet, Martin Hilka, Romain Bey

PDF

Open Access

TL;DR

This paper presents a hybrid NLP algorithm for de-identifying clinical documents, achieving high accuracy to facilitate research access while protecting patient privacy.

Contribution

It introduces a systematic pseudonymization system combining deep learning and rules, with detailed implementation insights and publicly shared code.

Findings

01

F1-score of 0.99 in de-identification accuracy

02

Analysis of factors affecting system performance

03

Guidelines and open-source code provided

Abstract

The objective of this study is to address the critical issue of de-identification of clinical reports in order to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. We annotated a corpus of clinical documents according to 12 types of identifying entities, and built a hybrid system, merging the results of a deep learning model as well as manual rules. Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Data Quality and Management · Topic Modeling