# Automating clinical phenotyping using natural language processing

**Authors:** Linea Schmidt, Susanne Ibing, Florian Borchert, Julian Hugo, Allison A. Marshall, Jellyana Peraza, Judy H. Cho, Erwin P. Böttinger, Bernhard Y. Renard, Ryan C. Ungaro

PMC · DOI: 10.1038/s43856-025-01337-0 · 2026-01-14

## TL;DR

This study compares rule-based NLP and GPT-4 for extracting Crohn’s disease features from clinical notes, showing high accuracy and potential to automate chart reviews.

## Contribution

The first study to explore LLM-based phenotyping for Crohn’s sub-phenotypes using sentence-level datasets and direct comparison with rule-based methods.

## Key findings

- GPT-4 achieved F1 scores of at least 0.90 for disease behavior and 0.82 for age at diagnosis at the note level.
- Combining rule-based and LLM approaches improved precision and enabled prioritization of chart reviews.
- Performance was comparable to human experts with no statistically significant difference.

## Abstract

Real-world studies based on electronic health records often require manual chart review to derive patients’ clinical phenotypes, a labor-intensive task with limited scalability. Here, we developed and compared computable phenotyping based on rules using the spaCy framework and a Large Language Model (LLM), GPT-4, for sub-phenotyping of patients with Crohn’s disease, considering age at diagnosis and disease behavior.

For our rule-based approach, we leveraged the spaCy framework and for the LLM-based approach, we used the GPT-4 model. The underlying data included 49,572 clinical notes and 2204 radiology reports from 584 Crohn’s disease patients. A test set of 280 clinical texts was labeled at sentence-level, in addition to patient-level ground truth data. The algorithms were evaluated based on their recall, precision, specificity values, and F1 scores.

Overall, we observe similar or better performance using GPT-4 compared to the rules. On a note-level, the F1 score is at least 0.90 for disease behavior and 0.82 for age at diagnosis, and on patient level at least 0.66 for disease behavior and 0.71 for age at diagnosis.

To our knowledge, this is the first study to explore computable phenotyping algorithms based on clinical narrative text for these complex tasks, where prior inter-annotator agreements ranged from 0.54 to 0.98. There is no statistical evidence for a difference to the performance of human experts on this task. Our findings underline the potential of LLMs for computable phenotyping and may support large-scale cohort analyses from electronic health records and streamline chart review processes in the future.

Doctors and researchers often need to group patients by specific medical features (called “phenotypes”) to study the disease and improve care. Much of this information is in free-text clinical notes rather than in tabular data. As an example, for patients with Crohn’s disease, the free-text notes can include important details describing the disease course over time, such as bowel narrowings (strictures), abnormal openings (fistulas), problems with the area around the anus, and age at diagnosis. Prior studies show that relying only on structured data, such as codes used to describe particular diseases, often miss these types of details. Reading notes by hand is more accurate but slow and costly. Natural language processing (NLP) is a computational method to automatically read and extract this information from clinical text. We built two NLP approaches and created new sentence-level datasets to test them. Both approaches found complications well, with combining both approaches giving the most balanced results. This method could save time in research and help clinicians flag people who may need extra care.

Schmidt et al. develop sentence-level datasets for Crohn’s phenotypes and compare rule-based NLP with GPT-4 to extract disease behavior and age at diagnosis from EHR notes. Both methods achieve high recall on notes; GPT-4 perfectly identifies age at diagnosis and simple ensembles improve precision and enable chart-review prioritization.

## Linked entities

- **Diseases:** Crohn's disease (MONDO:0005011)

## Full-text entities

- **Diseases:** LLM (MESH:D007806), digestive diseases (MESH:D004066), Strictures (MESH:D003251), fistulas (MESH:D005402), infection (MESH:D007239), carotid stenosis (MESH:D016893), CD (MESH:D003424), IBD (MESH:D015212), disease (MESH:D004194), abscess (MESH:D000038), inflammation (MESH:D007249), Perianal disease (MESH:D000694)
- **Chemicals:** GPT-4 (-), luminal (MESH:D010634)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12873203/full.md

---
Source: https://tomesphere.com/paper/PMC12873203