# Identification and validation of respiratory virus immunization using natural language processing

**Authors:** Kevin A. Wilson, John J. Riddles, Andrew C. Hill, Elizabeth A. Bassett, Mengshi Zhou, Michelle Barron, Catia Chavez, Rahul Shrivastava, Anil Battalahalli, Daniel Chacreton, Ethan Moran, Elizabeth Rowley, Zachary A. Weber, Lawrence Reichle, Sarah W. Ball, Amanda B. Payne, Jennifer DeCuir, Ruth Link-Gelles, Toan C. Ong

PMC · DOI: 10.3389/fdgth.2026.1733630 · 2026-02-02

## TL;DR

The paper introduces an NLP algorithm to detect respiratory virus immunizations in electronic health records, showing high accuracy for some vaccines but lower recall when compared to structured data.

## Contribution

A novel rule-based NLP algorithm was developed and validated for identifying respiratory virus immunizations in unstructured EHR text.

## Key findings

- The algorithm achieved high recall (97% for COVID-19) when compared to manual review but low recall (9% for COVID-19) when compared to structured data.
- The method demonstrated effectiveness for influenza and RSV immunizations with high precision and moderate recall.
- The algorithm can augment structured immunization records by extracting data from narrative EHR text.

## Abstract

Electronic health record (EHR)-based research often relies on structured data elements, such as ICD-10-CM and CPT codes, to identify clinical diagnoses and procedures. However, some information, such as the administration of immunizations, may be captured more reliably in the text-based narrative sections of the patient's record. We developed a rule-based natural language processing (NLP) algorithm to identify the administration of immunizations for COVID-19, influenza, and RSV using a combination of synthetic and publicly available data.

After applying standard NLP processing techniques to clean and standardize the text, we implemented a multi-stage, rule-based algorithm. We applied a dictionary of general keywords to identify potential immunizations, and a set of specific keywords, which leveraged grammatical dependencies in the text, to increase accuracy. We implemented additional rules to account for negation and immunization recommendations. The algorithm was applied to a sample of 20,000 patients from the study population. We measured performance by conducting a manual review of 400 individual notes and assessing concurrence with structured data, using precision and recall as evaluation metrics.

In the first evaluation, which compared the performance of the algorithm with an independent test dataset using manual clinical review, precision was 71% and recall was 97% for COVID-19 immunization; 91% and 92% for Influenza; and 57% and 96% for RSV. In a second evaluation using structured data as the gold standard (i.e., ICD-10-CM, CPT, and CVX codes), precision was 72% and recall was 9% for COVID-19 immunization; 71% and 12% for Influenza; and for RSV, precision was 78% and recall was 10%.

We demonstrated the effectiveness of NLP methods in identifying immunizations from EHR. High precision and recall for COVID-19 and influenza immunizations suggest that the algorithm can effectively identify immunization references when they are present in the text; however, low recall when compared to the structured data suggests that there are many more immunizations in the structured data not present in the text. Thus, the algorithm has specialized utility for augmenting immunization records using text data from individual notes; however, the algorithm's extensibility and generalizability can serve as a framework for future EHR-based research.

## Linked entities

- **Diseases:** COVID-19 (MONDO:0100096), influenza (MONDO:0005812)

## Full-text entities

- **Diseases:** COVID-19 (MESH:D000086382), Influenza (MESH:D007251)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12908168/full.md

---
Source: https://tomesphere.com/paper/PMC12908168