Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions
Kuleen Sasse, Shinjitha Vadlakonda, Richard E. Kennedy, John D., Osborne

TL;DR
This study demonstrates that synthetic training data generated by a large language model significantly enhances disease entity normalization performance, especially for out-of-distribution data, with limited impact on recognition accuracy.
Contribution
The paper introduces a novel approach of using LLM-generated synthetic mentions to improve disease normalization, showing substantial gains in out-of-distribution scenarios.
Findings
Synthetic data improved DEN top 1 accuracy by 3-9 points.
Out-of-distribution DEN accuracy increased by 20-55 points.
Limited improvement observed for DER recognition performance.
Abstract
Background: Machine learning methods for clinical named entity recognition and entity normalization systems can utilize both labeled corpora and Knowledge Graphs (KGs) for learning. However, infrequently occurring concepts may have few mentions in training corpora and lack detailed descriptions or synonyms, even in large KGs. For Disease Entity Recognition (DER) and Disease Entity Normalization (DEN), this can result in fewer high quality training examples relative to the number of known diseases. Large Language Model (LLM) generation of synthetic training examples could improve performance in these information extraction tasks. Methods: We fine-tuned a LLaMa-2 13B Chat LLM to generate a synthetic corpus containing normalized mentions of concepts from the Unified Medical Language System (UMLS) Disease Semantic Group. We measured overall and Out of Distribution (OOD) performance for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
