# Sensitivity and Specificity of Natural Language Processing Systems for Identification of Hospitalized People Who Use Drugs

**Authors:** Leah Benrubi, Taisuke Sato, Leo K Westgard, Kyle Zollo-Venecek, Brindet Socrates, Benjamin Sweigart, Jessica P Ridgway, Joji Suzuki, Yoelkys Morales, David Goodman-Meza, Alysse G Wurcel

PMC · DOI: 10.1093/ofid/ofaf370 · Open Forum Infectious Diseases · 2025-06-23

## TL;DR

This study shows that natural language processing can better identify hospitalized drug users compared to traditional ICD-10 codes, though it requires balancing accuracy and false positives.

## Contribution

The study introduces NLP as a novel method to improve the identification of hospitalized people who use drugs beyond ICD-10 limitations.

## Key findings

- ICD-10 codes alone had low sensitivity (43%) but high specificity (99%) for identifying PWUD hospitalizations.
- Adding NLP increased sensitivity to 94% but reduced specificity to 46%.
- A balanced model using Regular Expression achieved 74% sensitivity and 87% specificity.

## Abstract

People who use drugs (PWUD) often lack access to optimal harm reduction and substance use disorder treatment tools. Tracking the epidemiology of acute care utilization by PWUD is crucial to improving systems of care. Chart reviews and International Classification of Diseases (ICD) codes are the most common systems of identifying hospitalizations of PWUD but are limited by high labor costs and inaccuracy. This study evaluates whether natural language processing (NLP) enhances the sensitivity and specificity of ICD-10 codes in identifying hospitalizations of PWUD.

We analyzed admissions at Tufts Medical Center between 2018 and 2023. Two NLP tools (Regular Expression and Open Health NLP Toolkit) were developed to identify PWUD and were compared with ICD-10 algorithms. The NLP and ICD-10 algorithms were applied to all admissions, and demographic and hospitalization-related data were extracted. The research team manually reviewed notes written during 790 hospitalizations of PWUD as the gold standard. We calculated sensitivity, specificity, and net reclassification indices.

ICD-10 codes alone demonstrated low sensitivity (43%) but high specificity (99%). Adding NLP systems improved sensitivity up to 94%, though specificity decreased to 46%. Threshold adjustments (eg, notes flagged ≥50%) revealed a trade-off between sensitivity (47%) and specificity (96%). The most practical model—Regular Expression or ICD-10 codes—resulted in a sensitivity of 74% and specificity of 87%.

NLP is an innovative tool that can create functional, cost-effective, and accurate systems of identifying hospitalized PWUD. These findings support further development of NLP technologies to improve health care equity for PWUD.

This study investigates the use of natural language processing (NLP) in identifying the hospitalization encounters of people who use drugs. When compared with ICD-10 codes alone, NLP improved sensitivity but decreased specificity. NLP is a promising avenue for improving health care for hospitalized people who use drugs.

Graphical Abstract

## Full-text entities

- **Diseases:** substance use disorder (MESH:D019966)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12264426/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12264426/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/PMC12264426/full.md

---
Source: https://tomesphere.com/paper/PMC12264426