Effective and scalable clustering of SARS-CoV-2 sequences
Sarwan Ali, Tamkanat-E-Ali, Muhammad Asad Khan, Imdadullah Khan,, Murray Patterson

TL;DR
This paper introduces a scalable clustering method for millions of SARS-CoV-2 sequences, enabling efficient identification of variants and their spread over time, which is crucial for vaccine development and epidemiological studies.
Contribution
The authors develop a novel k-mer based clustering approach that is both effective and scalable for analyzing vast SARS-CoV-2 sequence datasets, surpassing traditional phylogenetic methods.
Findings
Successfully identified major SARS-CoV-2 variants from large datasets
Estimated variant proportions and spread rates over time and locations
Highlighted key amino acid positions relevant for variant discrimination
Abstract
SARS-CoV-2, like any other virus, continues to mutate as it spreads, according to an evolutionary process. Unlike any other virus, the number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million. This amount of data has the potential to uncover the evolutionary dynamics of a virus like never before. However, a million is already several orders of magnitude beyond what can be processed by the traditional methods designed to reconstruct a virus's evolutionary history, such as those that build a phylogenetic tree. Hence, new and scalable methods will need to be devised in order to make use of the ever increasing number of viral sequences being collected. Since identifying variants is an important part of understanding the evolution of a virus, in this paper, we propose an approach based on clustering sequences to identify the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsvaccines and immunoinformatics approaches · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
MethodsFeature Selection
