# P-1966. Classification of Injection Drug Use by a Large Language Model Using Hospital Admission Notes

**Authors:** Edward C Traver, Seyed M Shams, Ishan Kumar Vaish, Jasmine Stevens, Meghan Derenoncourt, Hannah E Flores, Elana S Rosenthal, Sarah Kattakuzhy

PMC · DOI: 10.1093/ofid/ofaf695.2133 · Open Forum Infectious Diseases · 2026-01-11

## TL;DR

This study explores using a large language model to identify people who inject drugs from hospital admission notes, aiming to improve clinical interventions.

## Contribution

The novel use of a large language model (LLaMA 3.3) for classifying injection drug use in clinical notes is demonstrated.

## Key findings

- The LLM achieved a sensitivity of 0.68 and specificity of 0.80 in classifying injection drug use.
- Positive predictive value dropped significantly at lower injection drug use prevalence (e.g., 0.03 at 1% prevalence).
- The LLM's performance suggests potential for improvement with refined prompts and additional data.

## Abstract

People who inject drugs (PWID) are at higher risk for severe bacterial infectious diseases (ID), which drive expensive hospitalizations. Identification of PWID allows for linkage to clinical interventions, such as multidisciplinary ID-addiction treatment teams, which improve clinical outcomes. Yet injection drug use (IDU) is often captured only in the text of clinical notes and is not easily queried. We sought to demonstrate text-based IDU classification by a large language model (LLM), a type of artificial intelligence.Figure 1Workflow of the Classification and Labeling Procedures and Full Text of the Prompt. IDU, injection drug use; LLM, large language model; LLaMA 3.3 is the LLM used.Figure 2Confusion matrix of LLM labeling performance compared to human classification (treated as the ground truth). IDU, injection drug use; LLM, large language model.

Workflow of the Classification and Labeling Procedures and Full Text of the Prompt. IDU, injection drug use; LLM, large language model; LLaMA 3.3 is the LLM used.

Confusion matrix of LLM labeling performance compared to human classification (treated as the ground truth). IDU, injection drug use; LLM, large language model.

Hospital encounters at an academic medical center between 2018 and 2022 were included if they featured ICD codes for both acute infections and opioid use. Encounters were reviewed by trained research assistants and classified as “IDU” or “non-IDU” based on clinical notes. A balanced sample of 100 encounters was selected randomly for the LLM classification. The hospital admission note was extracted from the electronic medical record (Epic). A zero-shot prompt instructed the LLM (LLaMA 3.3; Meta, 70B parameters) to label each encounter as “IDU” or “non-IDU” (Figure 1). LLM labels were compared to human classifications. Positive and negative predictive values (PPV, NPV) were estimated for varying IDU prevalence. 95% confidence intervals were estimated with the Wilson-Brown method.Figure 3LLM labeling compared to human classification. Error bars, 95% CIFigure 4Estimated PPV and NPV of LLM labeling compared to classification in theoretical cohorts of varying IDU prevalence. IDU, injection drug use; LLM, large language model; NPV, negative predictive value; PPV, positive predictive value. Error bars, 95% CI.

LLM labeling compared to human classification. Error bars, 95% CI

Estimated PPV and NPV of LLM labeling compared to classification in theoretical cohorts of varying IDU prevalence. IDU, injection drug use; LLM, large language model; NPV, negative predictive value; PPV, positive predictive value. Error bars, 95% CI.

Of the 50 IDU and 50 non-IDU encounters, the LLM labeling yielded 34 true positives, 16 false negatives, 40 true negatives, and 10 false positives (Figure 2). Sensitivity was 0.68 (95% CI 0.54-0.79); specificity 0.80 (95% CI 0.67-0.89; Figure 3). Accuracy of the LLM label was 0.74; F1-score 0.72. Estimates of PPV with IDU prevalence of 50%, 10%, and 1% were 0.77, 0.27, and 0.03; estimates of NPV were 0.71, 0.96, and >0.99 (Figure 4).

In this small pilot study, an LLM demonstrated moderate performance on identifying PWID. The performance would likely limit usability in screening cohorts with real-world prevalence of IDU (1-10%). Future work will seek improved performance by refining the LLM prompt, evaluating other LLMs, and examining additional data (eg, ID consultation notes). Additional validation is needed with larger, distinct datasets. LLMs holds promise to identify hospitalized PWID to improve health outcomes.

All Authors: No reported disclosures

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12793114/full.md

---
Source: https://tomesphere.com/paper/PMC12793114