Disease Entity Recognition and Normalization is Improved with Large   Language Model Derived Synthetic Normalized Mentions

Kuleen Sasse; Shinjitha Vadlakonda; Richard E. Kennedy; John D.; Osborne

arXiv:2410.07951·cs.CL·October 11, 2024

Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions

Kuleen Sasse, Shinjitha Vadlakonda, Richard E. Kennedy, John D., Osborne

PDF

Open Access

TL;DR

This study demonstrates that synthetic training data generated by a large language model significantly enhances disease entity normalization performance, especially for out-of-distribution data, with limited impact on recognition accuracy.

Contribution

The paper introduces a novel approach of using LLM-generated synthetic mentions to improve disease normalization, showing substantial gains in out-of-distribution scenarios.

Findings

01

Synthetic data improved DEN top 1 accuracy by 3-9 points.

02

Out-of-distribution DEN accuracy increased by 20-55 points.

03

Limited improvement observed for DER recognition performance.

Abstract

Background: Machine learning methods for clinical named entity recognition and entity normalization systems can utilize both labeled corpora and Knowledge Graphs (KGs) for learning. However, infrequently occurring concepts may have few mentions in training corpora and lack detailed descriptions or synonyms, even in large KGs. For Disease Entity Recognition (DER) and Disease Entity Normalization (DEN), this can result in fewer high quality training examples relative to the number of known diseases. Large Language Model (LLM) generation of synthetic training examples could improve performance in these information extraction tasks. Methods: We fine-tuned a LLaMa-2 13B Chat LLM to generate a synthetic corpus containing normalized mentions of concepts from the Unified Medical Language System (UMLS) Disease Semantic Group. We measured overall and Out of Distribution (OOD) performance for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling