Characterizing SARS-CoV-2 Spike Sequences Based on Geographical Location
Sarwan Ali, Babatunde Bello, Zahra Tayebi, Murray Patterson

TL;DR
This paper presents a scalable machine learning approach to classify SARS-CoV-2 spike protein sequences by geographical location using k-mer based numerical representations, outperforming baseline methods.
Contribution
Introduces a novel scalable method combining k-mer features and machine learning for geographic classification of viral sequences, addressing data size challenges.
Findings
Model significantly outperforms baseline classifiers.
Identifies key amino acids influencing classification.
Demonstrates scalability to large genomic datasets.
Abstract
With the rapid spread of COVID-19 worldwide, viral genomic data is available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis towards the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected -- the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this paper, we propose an approach that first computes a numerical representation of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Fractal and DNA sequence analysis · vaccines and immunoinformatics approaches
