# Extracting TNFi switching reasons and trajectories from real-world data using large language models

**Authors:** Brenda Y Miao, Marie Binvignat, Augusto Garcia-Agundez, Maxim Bravo, Christopher Yk Williams, Claire Q Miao, Ahmed Alaa, Vivek Rudrapatna, Atul J Butte, Gabriela Schmajuk, Jinoos Yazdany

PMC · DOI: 10.1093/jamiaopen/ooaf132 · JAMIA Open · 2025-11-11

## TL;DR

This study shows that large language models like GPT-4 can accurately extract patterns and reasons for switching TNFi treatments from real-world medical records.

## Contribution

Demonstrates the effectiveness of LLMs in automating chart review for TNFi switching patterns and reasons in real-world data.

## Key findings

- GPT-4 achieved high micro-F1 scores (0.75-0.83) in identifying TNFi switches and reasons from clinical notes.
- Lack of efficacy was the most common reason for switching TNFi treatments (56.9%).
- Open-source models like Starling-7B-beta and Llama-3-8B also performed competitively with GPT-4.

## Abstract

To evaluate whether large language models (LLMs) can automate chart review to identify tumor necrosis factor inhibitor (TNFi) switching patterns and reasons for switching in a large real-world cohort.

We conducted an observational study using de-identified electronic health record (EHR) data from 2012 to 2023 at a single academic medical center (University of California, San Francisco). TNFi medication orders and linked clinical notes were extracted, requiring at least 6 months of follow-up to identify treatment switches, defined as a change from one TNFi to another at consecutive encounters. Using GPT-4, we extracted which TNFi was stopped and started and classified the reason for switching. Performance was benchmarked against eight open-source LLMs, structured EHR data, and expert annotation.

A total of 9187 patients (mean [SD] age, 39.9 [19.0] years; 57.1% female) received ≥1 TNFi with sufficient follow-up. We identified 3104 TNFi switches among 2112 patients. GPT-4 achieved micro-F1 scores of 0.75 (stopped drug), 0.80 (started drug), and 0.83 (reason). Among open-source models, Starling-7B-beta and Llama-3-8B performed most competitively. The most common reason identified by GPT-4 was lack of efficacy (56.9%), followed by adverse events (13.5%) and insurance/cost (10.8%).

Both GPT-4 and locally deployable LLMs effectively extracted complex treatment trajectories and rationale from clinical notes, supporting their broader utility in scalable EHR review and real-world evidence generation.

## Full-text entities

- **Chemicals:** GPT-4 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12605798/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12605798/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/PMC12605798/full.md

---
Source: https://tomesphere.com/paper/PMC12605798