A novel approach to analyzing the evolution of SARS-CoV-2 based on visualization and clustering of large genetic data compactly represented in operative memory
A.Yu. Palyanov, N.V. Palyanova

TL;DR
This paper introduces a compact data representation method for SARS-CoV-2 genomes, enabling efficient analysis of millions of variants in a PC's memory.
Contribution
A novel data compression and visualization approach for SARS-CoV-2 genomes with a 1:1500 compression ratio and detailed evolutionary insights.
Findings
The method compresses 465 Gb of SARS-CoV-2 genome data to 330 Mb in RAM.
Evolutionary patterns were visualized using 3D surfaces over 4.5 years with weekly resolution.
The approach reveals more detailed viral evolution than traditional phylogenetic trees.
Abstract
SARS-CoV-2 is a virus for which an outstanding number of genome variants were collected, sequenced and stored from sources all around the world. Raw data in FASTA format include 16.8 million genomes, each ≈29,900 nt (nucleotides), with a total size of ≈500 ∙ 109 nt, or 465 Gb. We suggest an approach to data representation and organization, with which all this can be stored losslessly in the operative memory (RAM) of a common PC. Moreover, just ≈330 Mb will be enough. Aligning all genomes versus the initial Wuhan-Hu-1 reference sequence allows each to be represented as a data structure containing lists of point mutations, deletions and insertions. Our implementation of such data representation resulted in a 1:1500 compression ratio (for comparison, compression of the same data with the popular WinRAR archiver gives only 1:62) and fast access to genomes (and their metadata) and…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Bacteriophages and microbial interactions · SARS-CoV-2 and COVID-19 Research
