Virus genome sequence classification using features based on nucleotides, words and compression
T. Wang, M. Herbster, I.S. Mian

TL;DR
This study evaluates various genome sequence features and classifiers for virus taxonomy, achieving high accuracy in classifying viruses into ICTV Orders using sequence-derived features and machine learning.
Contribution
It introduces a comprehensive comparison of nucleotide, word, and compression-based features with classifiers for virus genome classification, highlighting the effectiveness of 4-mer counts and SVM.
Findings
Best classifier: 4-mer counts with SVM (error rate 0.006)
Genome length and k-NN perform worst but still reasonably
Predicted orders for unclassified viruses with high accuracy
Abstract
The ICTV develops, refines and maintains a universal virus taxonomy; Order is the highest taxon in the branching hierarchy of recognised viral taxa. Historically, ICTV (sub)committees have classified viruses on the basis of morphological characteristics and various other attributes. Today, virtually all new viral genomes are assembled from metagenomic datasets and are not linked directly to biological agents. Thus, placing a virus into a taxonomic scheme solely from primary genome structure is an increasingly important problem. Various simple descriptive statistics of a viral genome sequence have been used successfully for virus classification. Here, we use the NCBI's viral and viroid reference sequence collection (RefSeq) and a common experimental framework to compare the performance of different genome sequence-derived features and classifiers in the task of assigning a virus to one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Bacteriophages and microbial interactions · RNA and protein synthesis mechanisms
