Robust Benchmarking for Machine Learning of Clinical Entity Extraction
Monica Agrawal, Chloe O'Connell, Yasmin Fatemi, Ariel Levy, David, Sontag

TL;DR
This paper critically evaluates current clinical entity extraction systems, revealing their brittleness and proposing a new annotation framework for more robust benchmarking and future improvements.
Contribution
It introduces a reformulated annotation framework that accounts for vocabulary inconsistencies and evaluation limitations, enabling more reliable benchmarking of clinical entity extraction methods.
Findings
High accuracy for common concepts (95.3%)
Lower accuracy for unseen concepts (69.3%)
Achieved Jaccard similarity of 0.73 between annotators
Abstract
Clinical studies often require understanding elements of a patient's narrative that exist only in free text clinical notes. To transform notes into structured data for downstream use, these elements are commonly extracted and normalized to medical vocabularies. In this work, we audit the performance of and indicate areas of improvement for state-of-the-art systems. We find that high task accuracies for clinical entity normalization systems on the 2019 n2c2 Shared Task are misleading, and underlying performance is still brittle. Normalization accuracy is high for common concepts (95.3%), but much lower for concepts unseen in training data (69.3%). We demonstrate that current approaches are hindered in part by inconsistencies in medical vocabularies, limitations of existing labeling schemas, and narrow evaluation techniques. We reformulate the annotation framework for clinical entity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Machine Learning in Healthcare
