Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses
Troy Hernandez, Jie Yang

TL;DR
This paper introduces a new alignment-free genome vectorization method called the generalized vector, which improves virus classification accuracy and speed, enabling automated phylogenetic classification of viruses.
Contribution
The paper proposes the generalized vector, an alignment-free genome representation that outperforms existing methods in virus classification tasks.
Findings
The generalized vector outperforms other methods in virus classification accuracy.
The method maintains high speed comparable to existing alignment-free techniques.
It effectively classifies viruses across different phylogenetic levels.
Abstract
The typical process for classifying and submitting a newly sequenced virus to the NCBI database involves two steps. First, a BLAST search is performed to determine likely family candidates. That is followed by checking the candidate families with the Pairwise Sequence Alignment tool for similar species. The submitter's judgement is then used to determine the most likely species classification. The aim of this paper is to show that this process can be automated into a fast, accurate, one-step process using the proposed alignment-free method and properly implemented machine learning techniques. We present a new family of alignment-free vectorizations of the genome, the generalized vector, that maintains the speed of existing alignment-free methods while outperforming all available methods. This new alignment-free vectorization uses the frequency of genomic words (k-mers), as is done in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Genetic diversity and population structure · Chromosomal and Genetic Variations
