# Identifying Early Signals From Emerging Public Health Events Using Natural Language Processing

**Authors:** Kelly S. Peterson, Christian Dalton, Andrea Kalvesmaki, JoAnn Vuong, Colton Gordon, Senthil Nachimuthu, Mary Jo Pugh, Makoto M. Jones

PMC · DOI: 10.1155/ipid/6176855 · Interdisciplinary Perspectives on Infectious Diseases · 2026-03-06

## TL;DR

This paper explores using natural language processing to detect early signs of public health threats by analyzing clinical documents for specific signals.

## Contribution

The study introduces a method to identify early signals like public health communication and pathogen exposure mentions using NLP in clinical data.

## Key findings

- Positive predictive values for early signals ranged from 0.615 to 1.0, showing acceptable accuracy.
- Over 33 million emergency department visits were analyzed, with extracted exposures matching expected pathogen patterns.
- Automated NLP methods proved feasible for scaling biosurveillance efforts.

## Abstract

Timely detection of emerging public health threats is challenging because the surveillance infrastructure is not yet tuned to the emerging threat. We attempt to identify three nonspecific early signals that might be common across emerging events: public health authority communication, zoonotic exposure mentions, and other pathogen exposure mentions.

Data from U.S. Department of Veterans Affairs emergency department visits between 2004 and 2024 were used to construct training and validation sets from reportable or emerging infectious diseases identified by historical diagnoses and laboratories. Not all early signal types were extracted using the same method. Rule‐based and transformer models were used in a way to minimize developer and chart reviewer time. We then extracted cases from historic documents among selected diseases.

Positive predictive values for public health authority communication, zoonotic exposure, and other pathogen exposure ranged from 0.615 to 1.0. Target concepts were extracted from over 33 million emergency department visits. Distributions of extracted exposures generally matched expectations for the identified pathogen.

Automated natural language processing methods allow surveillance scaling to large amounts of clinical documents to identify relevant cases. Initial validation compared to manual text review shows that accuracy is acceptable for initial feasibility exploration in biosurveillance efforts.

## Full-text entities

- **Diseases:** fever (MESH:D005334), gastrointestinal distress (MESH:D012128), dengue (MESH:D003715), infection by Leptospira (MESH:D007922), Q fever (MESH:D011778), NND (MESH:D004194), Zika (MESH:D000071243), infectious disease (MESH:D003141), tularemia (MESH:D014406), COVID-19 (MESH:D000086382), rabies (MESH:D011818), infection (MESH:D007239)
- **Chemicals:** Water (MESH:D014867)
- **Species:** Canis lupus familiaris (dog, subspecies) [taxon 9615], H5N1 subtype (serotype) [taxon 102793], Oryctolagus cuniculus (domestic rabbit, species) [taxon 9986], Ixodida (ticks, order) [taxon 6935], Bos taurus (bovine, species) [taxon 9913], Felis catus (cat, species) [taxon 9685], Homo sapiens (human, species) [taxon 9606], Giraffa camelopardalis (giraffe, species) [taxon 9894], Ostreidae (oysters, family) [taxon 6563], Cercopithecidae (monkey, family) [taxon 9527]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12964168/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12964168/full.md

## References

52 references — full list in the complete paper: https://tomesphere.com/paper/PMC12964168/full.md

---
Source: https://tomesphere.com/paper/PMC12964168