TL;DR
This paper introduces a robust feature selection method combined with efficient representation techniques to improve clustering accuracy of SARS-CoV-2 spike protein sequences, aiding in variant analysis and pandemic response.
Contribution
The study presents a novel feature selection approach that enhances clustering of SARS-CoV-2 spike sequences, improving variant differentiation accuracy.
Findings
Higher F1 scores achieved with proposed feature selection.
Effective clustering of spike sequences using k-mers and feature selection.
Improved differentiation of SARS-CoV-2 variants.
Abstract
The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail unlike any virus before it. One one had, this will help biologists, policy makers and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On the other hand, such studies will help to more effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus contains different variants, each of them having different mutations, performing any analysis on such data becomes a difficult task. It is well known that much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence -- the relatively short region which codes for the spike protein(s). Hence, in this paper, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFeature Selection
