On the verge of life: Distribution of nucleotide sequences in viral RNAs
Mykola Husev, Andrij Rovenchak

TL;DR
This study analyzes nucleotide sequence distributions in single-stranded RNA viruses to classify and compare viruses, revealing correlations with species and differences in coronavirus sequences, potentially aiding disease classification.
Contribution
Introduces a novel approach using distribution parameters of nucleotide sequences for virus classification and comparison, expanding existing methods.
Findings
Best fit distributions identified as Polya and negative hypergeometric.
Parameters like entropy and mean sequence length effectively classify viruses.
Distinct nucleotide patterns found in coronavirus sequences related to virus type.
Abstract
The aim of the study is to analyze viruses using parameters obtained from distributions of nucleotide sequences in the viral RNA. Seeking for the input data homogeneity, we analyze single-stranded RNA viruses only. Two approaches are used to obtain the nucleotide sequences; In the first one, chunks of equal length (four nucleotides) are considered. In the second approach, the whole RNA genome is divided into parts by adenine or the most frequent nucleotide as a "space". Rank--frequency distributions are studied in both cases. Within the first approach, the P\'olya and the negative hypergeometric distribution yield the best fit. For the distributions obtained within the second approach, we have calculated a set of parameters, including entropy, mean sequence length, and its dispersion. The calculated parameters became the basis for the classification of viruses. We observed that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
