UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets
Yuta Hozumi, Rui Wang, Changchuan Yin, and Guo-Wei Wei

TL;DR
This paper presents a UMAP-assisted $k$-means clustering method to efficiently analyze large-scale SARS-CoV-2 mutation datasets, improving clustering accuracy and visualization for understanding virus evolution.
Contribution
The study introduces a novel combination of UMAP with $k$-means clustering for large genomic datasets, demonstrating its superiority over PCA and t-SNE in accuracy and efficiency.
Findings
UMAP outperforms PCA and t-SNE in clustering large SARS-CoV-2 datasets.
UMAP-assisted clustering improves visualization and accuracy of mutation analysis.
The method enables effective analysis of increasingly large genomic datasets.
Abstract
Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced -means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSARS-CoV-2 and COVID-19 Research · COVID-19 diagnosis using AI · SARS-CoV-2 detection and testing
