Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network Model
Sarwan Ali

TL;DR
This paper introduces an alignment-free, graph-based machine learning approach for analyzing SARS-CoV-2 spike protein sequences, improving clustering and classification accuracy over existing methods.
Contribution
The study presents a novel method converting protein sequences into sequence similarity networks, enabling effective graph-based ML analysis without sequence alignment.
Findings
Outperforms state-of-the-art clustering methods
Achieves higher classification accuracy with Node2Vec embeddings
Demonstrates effectiveness on unaligned and unassembled sequence data
Abstract
The SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to different hosts and evolve into different lineages. It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and how it can be perturbed is vital for understanding and determining if a lineage is of concern. These are crucial to identifying and controlling current outbreaks and preventing future pandemics. Machine learning (ML) methods are a viable solution to this effort, given the volume of available sequencing data, much of which is unaligned or even unassembled. However, such ML methods require fixed-length numerical feature vectors in Euclidean space to be applicable. Similarly, euclidean space is not considered the best choice when working with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Bioinformatics and Genomic Networks · SARS-CoV-2 and COVID-19 Research
