# COVID-19 outbreaks surveillance through text mining applied to electronic health records

**Authors:** Hermano Alexandre Lima Rocha, Erik Zarko Macêdo Solha, Vasco Furtado, Francion Linhares Justino, Lucas Arêa Leão Barreto, Ronaldo Guedes da Silva, Ítalo Martins de Oliveira, David Westfall Bates, Luciano Pamplona de Góes Cavalcanti, Antônio Silva Lima Neto, Erneson Alves de Oliveira

PMC · DOI: 10.1186/s12879-024-09250-y · BMC Infectious Diseases · 2024-03-28

## TL;DR

This study uses electronic health records and text mining to detect early signs of new COVID-19 outbreaks, weeks before traditional systems.

## Contribution

A novel data science model for real-time outbreak detection using emergency care records and text mining in low-resource settings.

## Key findings

- The model detected potential outbreaks with a time-lag of up to 72 days before confirmation.
- Cross-correlation values of up to 0.93 indicate strong alignment with confirmed cases during the second wave.
- The model's performance varied across pandemic waves, showing adaptability to changing conditions.

## Abstract

The COVID-19 pandemic has caused significant disruptions to everyday life and has had social, political, and financial consequences that will persist for years. Several initiatives with intensive use of technology were quickly developed in this scenario. However, technologies that enhance epidemiological surveillance in contexts with low testing capacity and healthcare resources are scarce. Therefore, this study aims to address this gap by developing a data science model that uses routinely generated healthcare encounter records to detect possible new outbreaks early in real-time.

We defined an epidemiological indicator that is a proxy for suspected cases of COVID-19 using the health records of Emergency Care Unit (ECU) patients and text mining techniques. The open-field dataset comprises 2,760,862 medical records from nine ECUs, where each record has information about the patient’s age, reported symptoms, and the time and date of admission. We also used a dataset where 1,026,804 cases of COVID-19 were officially confirmed. The records range from January 2020 to May 2022. Sample cross-correlation between two finite stochastic time series was used to evaluate the models.

For patients with age ≥18 years, we find time-lag (\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\tau_c$$\end{document}τc) = 72 days and cross-correlation (\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$${\widehat p}_{ij}$$\end{document}p^ij) ~0.82, \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\tau_c$$\end{document}τc = 25 days and \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$${\widehat p}_{ij}$$\end{document}p^ij ~0.93, and \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\tau_c$$\end{document}τc = 17 days and \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$${\widehat p}_{ij}$$\end{document}p^ij ~0.88 for the first, second, and third waves, respectively.

In conclusion, the developed model can aid in the early detection of signs of possible new COVID-19 outbreaks, weeks before traditional surveillance systems, thereby anticipating in initiating preventive and control actions in public health with a higher likelihood of success.

## Linked entities

- **Diseases:** COVID-19 (MONDO:0100096)

## Full-text entities

- **Diseases:** COVID-19 (MESH:D000086382)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC10976796/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC10976796/full.md

## References

27 references — full list in the complete paper: https://tomesphere.com/paper/PMC10976796/full.md

---
Source: https://tomesphere.com/paper/PMC10976796