Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse
Xavier Tannier, Perceval Wajsb\"urt, Alice Calliger, Basile Dura,, Alexandre Mouchet, Martin Hilka, Romain Bey

TL;DR
This paper presents a hybrid NLP algorithm for de-identifying clinical documents, achieving high accuracy to facilitate research access while protecting patient privacy.
Contribution
It introduces a systematic pseudonymization system combining deep learning and rules, with detailed implementation insights and publicly shared code.
Findings
F1-score of 0.99 in de-identification accuracy
Analysis of factors affecting system performance
Guidelines and open-source code provided
Abstract
The objective of this study is to address the critical issue of de-identification of clinical reports in order to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. We annotated a corpus of clinical documents according to 12 types of identifying entities, and built a hybrid system, merging the results of a deep learning model as well as manual rules. Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Data Quality and Management · Topic Modeling
