Benchmarking Modern Named Entity Recognition Techniques for Free-text Health Record De-identification
Abdullah Ahmed, Adeel Abbasi, Carsten Eickhoff

TL;DR
This paper evaluates various deep learning methods for automatically identifying and removing protected health information from electronic health records to facilitate data sharing for research while maintaining privacy.
Contribution
It systematically compares multiple NER techniques on EHR data, highlighting the effectiveness of BiLSTM-CRF and the impact of character embeddings and transformers.
Findings
BiLSTM-CRF outperforms other models in de-identification tasks
Character embeddings improve precision but reduce recall
Transformers alone underperform as context encoders
Abstract
Electronic Health Records (EHRs) have become the primary form of medical data-keeping across the United States. Federal law restricts the sharing of any EHR data that contains protected health information (PHI). De-identification, the process of identifying and removing all PHI, is crucial for making EHR data publicly available for scientific research. This project explores several deep learning-based named entity recognition (NER) methods to determine which method(s) perform better on the de-identification task. We trained and tested our models on the i2b2 training dataset, and qualitatively assessed their performance using EHR data collected from a local hospital. We found that 1) BiLSTM-CRF represents the best-performing encoder/decoder combination, 2) character-embeddings and CRFs tend to improve precision at the price of recall, and 3) transformers alone under-perform as context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
