# Estimation of unconfirmed COVID-19 cases from a cross-sectional survey of >10 000 households and a symptom-based machine learning model in Gilgit-Baltistan, Pakistan

**Authors:** Daniel S Farrar, Lisa G Pell, Yasin Muhammad, Sher Hafiz Khan, Lauren Erdman, Diego G Bassani, Zachary Tanner, Imran Ahmed Chauhadry, Muhammad Karim, Falak Madhani, Shariq Paracha, Masood Ali Khan, Sajid Soofi, Monica Taljaard, Rachel F Spitzer, Sarah M Abu Fadaleh, Zulfiqar A Bhutta, Shaun K Morris

PMC · DOI: 10.1136/bmjph-2024-001255 · BMJ Public Health · 2025-04-28

## TL;DR

This study estimates the number of unconfirmed COVID-19 cases in Gilgit-Baltistan, Pakistan, using a survey and a symptom-based model to identify undiagnosed infections.

## Contribution

The study introduces a symptom-based machine learning model to estimate undiagnosed SARS-CoV-2 infections in low-testing settings.

## Key findings

- The study estimates 8–17 total infections for each confirmed case in Gilgit-Baltistan.
- Children and women had the highest ratios of estimated to confirmed infections.
- The symptom-based model achieved a high predictive accuracy (AUC 0.92).

## Abstract

Robust estimates of COVID-19 prevalence in settings with limited capacity for SARS-CoV-2 molecular and serologic testing are scarce. We aimed to describe the epidemiology of confirmed and probable COVID-19 in Gilgit-Baltistan, and to develop a symptom-based predictive model to identify infected but undiagnosed individuals with COVID-19.

We conducted a cross-sectional survey in 10 257 randomly selected households in Gilgit-Baltistan from June to August 2021. Data regarding SARS-CoV-2 testing, healthcare worker (HCW) diagnoses, symptoms and outcomes since March 2020 were self-reported by households. ‘Confirmed/probable’ infection was defined as a positive test, HCW COVID-19 diagnosis or HCW pneumonia diagnosis with COVID-19-positive contact. Robust Poisson regression was conducted to assess differences in symptoms, outcomes and SARS-CoV-2 testing rates. We developed a symptom-based machine learning model to differentiate confirmed/probable infections from those with negative tests. We applied this model to untested respondents to estimate the total prevalence of SARS-CoV-2 infection.

Data were collected for 77 924 people. Overall, 314 (0.5%) had confirmed/probable infections, 3263 (4.4%) had negative tests and 74 347 (95.1%) were untested. Children were tested less often than adults (adjusted prevalence ratio (aPR) 0.08, 95% CI 0.06 to 0.12 for ages 1–4 years vs 30–39 years), while males were tested more often than females (aPR 1.51, 95% CI 1.40 to 1.63). In the predictive model, area under the receiver operating characteristic curve was 0.92 (95% CI 0.90 to 0.93). We estimate there were 8–17 total SARS-CoV-2 infections for each positive test (8–17:1). The ratio of estimated to confirmed cases was higher for ages 1–4 years (211–480:1), 5–9 years (80–185:1) and for females (13–25:1).

From March 2020 to August 2021, the majority of SARS-CoV-2 infections in Gilgit-Baltistan went unconfirmed, particularly among women and children. Predictive models which incorporate self-reported symptoms may improve understanding of the burden of disease in settings lacking diagnostic capacity.

## Linked entities

- **Diseases:** COVID-19 (MONDO:0100096), pneumonia (MONDO:0005249)

## Full-text entities

- **Diseases:** COVID-19 (MESH:D000086382), pneumonia (MESH:D011014), infected (MESH:D007239)
- **Species:** Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12039044/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12039044/full.md

## References

40 references — full list in the complete paper: https://tomesphere.com/paper/PMC12039044/full.md

---
Source: https://tomesphere.com/paper/PMC12039044