Characterizing SARS-CoV-2 Spike Sequences Based on Geographical Location

Sarwan Ali; Babatunde Bello; Zahra Tayebi; Murray Patterson

arXiv:2110.00809·cs.LG·October 14, 2022

Characterizing SARS-CoV-2 Spike Sequences Based on Geographical Location

Sarwan Ali, Babatunde Bello, Zahra Tayebi, Murray Patterson

PDF

Open Access 1 Repo

TL;DR

This paper presents a scalable machine learning approach to classify SARS-CoV-2 spike protein sequences by geographical location using k-mer based numerical representations, outperforming baseline methods.

Contribution

Introduces a novel scalable method combining k-mer features and machine learning for geographic classification of viral sequences, addressing data size challenges.

Findings

01

Model significantly outperforms baseline classifiers.

02

Identifies key amino acids influencing classification.

03

Demonstrates scalability to large genomic datasets.

Abstract

With the rapid spread of COVID-19 worldwide, viral genomic data is available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis towards the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected -- the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this paper, we propose an approach that first computes a numerical representation of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sarwanpasha/covid-19-country-classification
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Fractal and DNA sequence analysis · vaccines and immunoinformatics approaches