Identifying and Extracting Rare Disease Phenotypes with Large Language Models
Cathy Shyr, Yan Hu, Paul A. Harris, Hua Xu

TL;DR
This study evaluates ChatGPT's ability to extract rare disease phenotypes from unstructured text in zero- and few-shot settings, comparing it to traditional fine-tuning methods, and establishes a benchmark for this task.
Contribution
It introduces novel prompts for ChatGPT to extract rare disease phenotypes and benchmarks its performance against fine-tuned models, highlighting potential and limitations.
Findings
Fine-tuning BioClinicalBERT outperforms ChatGPT overall (F1 0.689 vs. 0.472/0.591).
ChatGPT matches or exceeds fine-tuned models for certain entities in one-shot settings.
Prompt engineering can enhance ChatGPT's entity extraction performance with minimal labeled data.
Abstract
Rare diseases (RDs) are collectively common and affect 300 million people worldwide. Accurate phenotyping is critical for informing diagnosis and treatment, but RD phenotypes are often embedded in unstructured text and time-consuming to extract manually. While natural language processing (NLP) models can perform named entity recognition (NER) to automate extraction, a major bottleneck is the development of a large, annotated corpus for model training. Recently, prompt learning emerged as an NLP paradigm that can lead to more generalizable results without any (zero-shot) or few labeled samples (few-shot). Despite growing interest in ChatGPT, a revolutionary large language model capable of following complex human prompts and generating high-quality responses, none have studied its NER performance for RDs in the zero- and few-shot settings. To this end, we engineered novel prompts aimed at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · Topic Modeling · Biomedical Text Mining and Ontologies
MethodsNone
