# Computable phenotypes to identify respiratory viral infections in the All of Us research program

**Authors:** Bennett J. Waxse, Fausto Andres Bustos Carrillo, Tam C. Tran, Huan Mo, Emily E. Ricotta, Joshua C. Denny

PMC · DOI: 10.1038/s41598-025-02183-9 · Scientific Reports · 2025-05-28

## TL;DR

The paper introduces a method to identify respiratory viral infections using EHR data, enabling large-scale studies on genetics and health outcomes.

## Contribution

A novel integrated approach combining ICD codes, prescriptions, and lab results to identify respiratory viral infections in EHR data.

## Key findings

- Integrated phenotypes identified infections more effectively than individual data components.
- Seasonal infection patterns matched CDC data, validating the method's accuracy.
- Sensitivity and PPV varied by virus, with high PPV for most viruses but lower for influenza and SARS-CoV-2.

## Abstract

Electronic health records (EHRs) contain rich temporal data about respiratory viral infections, but methods to identify these infections from EHR data vary widely and lack robust validation. We developed computable phenotypes by integrating virus-specific International Classification of Diseases (ICD) billing codes, prescriptions, and laboratory results within 90-day episodes. Analysis of 265,222 participants with EHR data from the All of Us Research Program yielded national cohorts of varied size: large cohorts for SARS-CoV-2 (n = 28,729) and influenza (n = 19,784); medium cohorts for rhinovirus, human coronavirus, and respiratory syncytial virus (n = 1,161-1,620); and smaller cohorts for the other viruses (n = 238–486). Using laboratory results as a reference standard, phenotypes using virus-specific ICD codes and medications had variable sensitivity (8–67%) but high positive predictive value (PPV, 90–97%) for most viruses, while influenza virus and SARS-CoV-2 phenotypes had lower PPV (69–70%) that improved with the inclusion of additional ICD codes. Identified infections exhibited expected seasonal patterns matching CDC data. This integrated approach identified infections more effectively than individual components alone and demonstrated utility for severe infections in hospital settings. This method enables large-scale studies of host genetics, health disparities, and clinical outcomes across episodic diseases, with flexibility to optimize sensitivity or PPV depending on the specific research question.

The online version contains supplementary material available at 10.1038/s41598-025-02183-9.

## Linked entities

- **Diseases:** SARS-CoV-2 (MONDO:0100096), influenza (MONDO:0005812)

## Full-text entities

- **Diseases:** infections (MESH:D007239), episodic diseases (MESH:C580065)
- **Species:** Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12120013/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12120013/full.md

## References

6 references — full list in the complete paper: https://tomesphere.com/paper/PMC12120013/full.md

---
Source: https://tomesphere.com/paper/PMC12120013