Named Entity Recognition in Unstructured Medical Text Documents
Cole Pearson, Naeem Seliya, Rushit Dave

TL;DR
This study evaluates the effectiveness of OpenNLP and spaCy for named entity recognition to de-identify sensitive information in medical examination reports, achieving high accuracy with spaCy.
Contribution
It compares the performance of two NLP toolkits for PII removal in medical texts, highlighting spaCy's superior results with a specific training split.
Findings
Both platforms achieve high de-identification performance (f-measure > 0.9)
spaCy trained with 70-30 split performs best
Effective PII removal in medical reports using NER tools
Abstract
Physicians provide expert opinion to legal courts on the medical state of patients, including determining if a patient is likely to have permanent or non-permanent injuries or ailments. An independent medical examination (IME) report summarizes a physicians medical opinion about a patients health status based on the physicians expertise. IME reports contain private and sensitive information (Personally Identifiable Information or PII) that needs to be removed or randomly encoded before further research work can be conducted. In our study the IME is an orthopedic surgeon from a private practice in the United States. The goal of this research is to perform named entity recognition (NER) to identify and subsequently remove/encode PII information from IME reports prepared by the physician. We apply the NER toolkits of OpenNLP and spaCy, two freely available natural language processing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques
