# Leveraging knowledge graphs and large language models for integrating molecular variants and clinical insights in COVID-19 research

**Authors:** Jiaxin Yang, Fushuai Zhang, Ruifang Cao, Yingying Chen, Yiping Chen, Yuxin Chen, Yixue Li, Guoping Zhao, Ying Wang, Yunchao Ling, Guoqing Zhang

PMC · DOI: 10.1016/j.bsheal.2025.12.003 · Biosafety and Health · 2025-12-20

## TL;DR

This study creates a knowledge graph and AI tool to link SARS-CoV-2 mutations with clinical outcomes, aiding in real-time variant risk assessment and public health decisions.

## Contribution

The novel contribution is the development of CoVAR-KG and CVRW, which integrate molecular and clinical data using a knowledge graph and GPT-4o for real-time variant risk forecasting.

## Key findings

- A knowledge graph (CoVAR-KG) with 1 million nodes was built from 439,724 studies, linking nine biomedical domains.
- The CVRW system uses graph-based retrieval and GPT-4o to predict WHO variant classifications in near real time.
- The framework enables interpretable evaluation of mutation effects on antigenicity, transmissibility, and immune escape.

## Abstract

•Scientific question This study addresses the challenge of systematically linking severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike-protein substitutions with clinical, immunological, and therapeutic outcomes for functional risk assessment of emerging variants.•Evidence before this study Existing coronavirus disease 2019 (COVID-19) databases and analytical frameworks are fragmented across molecular, immunological, and clinical domains, limiting integrative interpretation of mutation-driven functional effects.•New findings We curated and mined 439,724 COVID-19 studies to construct a 1-million-node knowledge graph (CoVAR-KG) spanning nine biomedical domains. Building on this resource, we developed the COVID-19 Variant Risk Watcher (CVRW), which combines graph-based retrieval with GPT-4o to forecast the World Health Organization (WHO) variant classifications in near real time.•Significance of the study This integrative framework enables interpretable, literature-grounded evaluation of mutation-induced changes in antigenicity, transmissibility, and immune escape, providing a scalable foundation for genomic surveillance and public health decision-making.

Scientific question This study addresses the challenge of systematically linking severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike-protein substitutions with clinical, immunological, and therapeutic outcomes for functional risk assessment of emerging variants.

Evidence before this study Existing coronavirus disease 2019 (COVID-19) databases and analytical frameworks are fragmented across molecular, immunological, and clinical domains, limiting integrative interpretation of mutation-driven functional effects.

New findings We curated and mined 439,724 COVID-19 studies to construct a 1-million-node knowledge graph (CoVAR-KG) spanning nine biomedical domains. Building on this resource, we developed the COVID-19 Variant Risk Watcher (CVRW), which combines graph-based retrieval with GPT-4o to forecast the World Health Organization (WHO) variant classifications in near real time.

Significance of the study This integrative framework enables interpretable, literature-grounded evaluation of mutation-induced changes in antigenicity, transmissibility, and immune escape, providing a scalable foundation for genomic surveillance and public health decision-making.

The relentless emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants continues to challenge global health, as high mutation rates and complex pathogenicity obscure molecular mechanisms and impede clinical progress. Despite extensive research across viral evolution, structural biology, immunology, diagnostics, and therapeutics, the resulting vast and rapidly outdated literature has widened the gap between fundamental discovery and medical application. Here, we systematically mined 439,724 coronavirus disease 2019 (COVID-19) publications using fine-tuned large language models to extract and distill knowledge across nine domains: antibodies, vaccines, serology, biochemistry, therapeutics, clinical presentation, risk factors, biomarkers, and diagnostics. These insights were integrated into a unified graph of 1,427,596 triples (CoVAR-KG). Covering 90 % of known spike-protein variant sites, our knowledge graph forges molecular-to-clinical links that reveal how specific mutations influence antigenicity, transmissibility, and treatment response. By resolving data fragmentation, this resource accelerates target identification and streamlines hypothesis generation. Building on CoVAR-KG, we developed COVID-19 variant risk watcher (CVRW), an early-warning framework that quantifies the threat of emerging variants for real-time surveillance. Coupling the graph with retrieval-augmented GPT-4o enables rapid and in-depth comparisons of variant functionality and immune escape potential. These integrative tools furnish timely insights for vaccine design, therapeutic optimization, and pandemic preparedness, establishing a versatile platform for combating current and future viral threats.

## Linked entities

- **Diseases:** coronavirus disease 2019 (MONDO:0100096), COVID-19 (MONDO:0100096)

## Full-text entities

- **Genes:** SPECC1 (sperm antigen with calponin homology and coiled-coil domains 1) [NCBI Gene 92521] {aka CYTSB, HCMOGT-1, HCMOGT1, NSP, NSP5}, Mpro [NCBI Gene 8673700], S (surface glycoprotein) [NCBI Gene 43740568] {aka spike glycoprotein}, ACE2 (angiotensin converting enzyme 2) [NCBI Gene 59272] {aka ACEH}
- **Diseases:** coronavirus (MESH:D018352), hallucinations (MESH:D006212), COVID-19 (MESH:D000086382), LLM (MESH:D007806), long COVID (MESH:D000094024), BA.1 (MESH:C538557)
- **Chemicals:** BE (MESH:D001608), GPT-4o (-), nirmatrelvir (MESH:C000718217)
- **Species:** Homo sapiens (human, species) [taxon 9606], Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049], Gammacoronavirus (genus) [taxon 694013]
- **Mutations:** G142D, N679K, T572I, N501Y, K356T, S50L, N501, R346T, P132H, D614G, P681H, F456L, V1104L
- **Cell lines:** JN.1 — Homo sapiens (Human), Lung small cell carcinoma, Cancer cell line (CVCL_0C15)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12931379/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12931379/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/PMC12931379/full.md

---
Source: https://tomesphere.com/paper/PMC12931379