# Research on New Methods of Topic Mining and Topic Prediction for Medical Preprints on Emerging Infectious Diseases

**Authors:** Zongjing Liang, Yun Kuang, Gongcheng Liang, Zhijie Li, Mingfeng Jiang

PMC · DOI: 10.7759/cureus.85773 · Cureus · 2025-06-11

## TL;DR

This paper introduces a new method combining public attention data and topic analysis to predict research trends on emerging infectious diseases using medical preprints.

## Contribution

A novel prediction framework integrating Google Trends and LDA topic modeling for real-time monitoring of medical preprint topics.

## Key findings

- Seven major research topics were identified from 18,060 COVID-19-related preprint abstracts using LDA.
- ARDL analysis confirmed a significant dynamic relationship between public search trends and topic intensity.
- The proposed method demonstrated good predictive performance for tracking topic evolution in medical preprints.

## Abstract

Background and purpose

To cope with the continuous risk of sudden infectious diseases and achieve real-time monitoring of research trends, this paper proposes a new prediction framework that combines public attention indicators with medical preprint topic analysis. In view of the lag problem of traditional topic prediction methods, this paper introduces Google Trends data to improve the timeliness of prediction.

Methods

In this study, 18,060 COVID-19-related preprint abstracts were obtained from the medRxiv platform using web crawler technology. The unsupervised probabilistic modeling method, Latent Dirichlet Allocation (LDA), was used to extract the latent topic structure in the text. In order to analyze the dynamic relationship between research topic intensity and public attention, the Autoregressive Distributed Lag (ARDL) model, which can simultaneously process I(0) and I(1) time series, was introduced. Text data preprocessing included word segmentation, stop word removal, lemmatization, and synonym standardization. Time series data were aggregated by week, the original data were logarithmized, the Augmented Dickey-Fuller (ADF) unit root test was used to determine stationarity, and non-stationary variables were differenced. The models were implemented in Python and EViews10, respectively.

Results

Seven major research topics were identified through LDA modeling. ARDL analysis verified that there was a significant dynamic relationship between public search trends and topic intensity, and that the model had good predictive performance.

Conclusion

This study combined LDA with ARDL models to construct a real-time prediction method that can be used to track the evolution of medical preprint topics. This method has important theoretical and practical significance in the field of public health informatics and provides feasible predictive support for the monitoring and prevention of future infectious diseases.

## Linked entities

- **Diseases:** COVID-19 (MONDO:0100096)

## Full-text entities

- **Diseases:** COVID-19 (MESH:D000086382), Infectious Diseases (MESH:D003141)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12248262/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12248262/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC12248262/full.md

---
Source: https://tomesphere.com/paper/PMC12248262