Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network   Model

Sarwan Ali

arXiv:2211.10546·cs.LG·November 23, 2022·1 cites

Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network Model

Sarwan Ali

PDF

Open Access

TL;DR

This paper introduces an alignment-free, graph-based machine learning approach for analyzing SARS-CoV-2 spike protein sequences, improving clustering and classification accuracy over existing methods.

Contribution

The study presents a novel method converting protein sequences into sequence similarity networks, enabling effective graph-based ML analysis without sequence alignment.

Findings

01

Outperforms state-of-the-art clustering methods

02

Achieves higher classification accuracy with Node2Vec embeddings

03

Demonstrates effectiveness on unaligned and unassembled sequence data

Abstract

The SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to different hosts and evolve into different lineages. It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and how it can be perturbed is vital for understanding and determining if a lineage is of concern. These are crucial to identifying and controlling current outbreaks and preventing future pandemics. Machine learning (ML) methods are a viable solution to this effort, given the volume of available sequencing data, much of which is unaligned or even unassembled. However, such ML methods require fixed-length numerical feature vectors in Euclidean space to be applicable. Similarly, euclidean space is not considered the best choice when working with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Bioinformatics and Genomic Networks · SARS-CoV-2 and COVID-19 Research