# Forecasting framework for dominant SARS-CoV-2 strains before clade replacement using phylogeny-informed genetic distances

**Authors:** Kyuyoung Lee, Atanas V. Demirev, Sangyi Lee, Seunghye Cho, Hyunbeen Kim, Junhyung Cho, Jeong-Sun Yang, Kyung-Chang Kim, Joo-Yeon Lee, Woojin Shin, Soyoung Lee, Sejik Park, Philippe Lemey, Man-Seong Park, Jin Il Kim

PMC · DOI: 10.3389/fmicb.2025.1619546 · Frontiers in Microbiology · 2025-06-20

## TL;DR

This paper introduces a forecasting framework to predict which SARS-CoV-2 variants will become dominant by analyzing genetic distances from clade roots, helping improve vaccine updates.

## Contribution

A novel forecasting framework using phylogeny-informed genetic distances to predict SARS-CoV-2 clade replacements before they occur.

## Key findings

- The framework accurately predicted clade replacements with an AUROC > 0.90 using both complete genomes and spike gene data.
- Dominant and extinct variants showed distinct patterns in non-synonymous and synonymous genetic distances from clade roots.
- The approach provides quantifiable molecular criteria for vaccine updates based on genetic novelty.

## Abstract

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative agent of the global coronavirus disease 2019 (COVID-19) pandemic and continues to drive successive waves of infection through the emergence of novel variants. Consequently, accurately predicting the next clade roots through global surveillance is crucial for effective prevention, control, and timely updates of vaccine antigen updates. This study evaluated the evolutionary dynamics of SARS-CoV-2 using phylogeny-informed genetic distances based on 394 complete genomes and spike (S) gene sequences. Furthermore, we introduced a forecasting framework to estimate the potential of emerging variants leading to clade replacement by analyzing non-synonymous and synonymous genetic distances from clade roots, which reflect global herd immune pressure.

Non-synonymous and synonymous genetic distances from both Wuhan and clade root strains were assessed to predict whether a clade would become dominant or extinct within 3 months before the clade replacement.

Through five observed clade replacements up to January 2024, we captured the quantifiable heterogeneity in non-synonymous and synonymous genetic distances of the S gene from clade roots between dominant and extinct variants, as measured by the extent of novelty, whether through gradual or drastic change.

Our framework demonstrated high predictability for identifying the next clade root before replacement in both training and test datasets (area under the receiver operating characteristic curve [AUROC] > 0.90) by incorporating differential weighting of non-synonymous and synonymous genetic distances. Additionally, the framework solely using spike gene data demonstrated similar accuracy to those using the complete genome. Overall, our approach establishes quantifiable molecular criteria for identifying potential updates to the SARS-CoV-2 vaccine, contributing to proactive pandemic preparedness.

## Linked entities

- **Genes:** S (Star) [NCBI Gene 33281]
- **Diseases:** coronavirus disease 2019 (MONDO:0100096), COVID-19 (MONDO:0100096)
- **Species:** Severe acute respiratory syndrome coronavirus 2 (taxon 2697049)

## Full-text entities

- **Genes:** S (surface glycoprotein) [NCBI Gene 43740568] {aka spike glycoprotein}
- **Diseases:** infection (MESH:D007239), COVID-19 (MESH:D000086382)
- **Species:** Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12226564/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12226564/full.md

## References

63 references — full list in the complete paper: https://tomesphere.com/paper/PMC12226564/full.md

---
Source: https://tomesphere.com/paper/PMC12226564