Comparing full variation profile analysis with the conventional consensus method in SARS-CoV-2 phylogeny

Regina Nóra Fiam; Csabai István; Solymosi Norbert

PMC · DOI:10.1093/bib/bbae296·June 26, 2024

Comparing full variation profile analysis with the conventional consensus method in SARS-CoV-2 phylogeny

Regina Nóra Fiam, Csabai István, Solymosi Norbert

PDF

Open Access

TL;DR

This paper introduces a new method for analyzing SARS-CoV-2 mutations that improves accuracy by considering all nucleotides, not just the most common ones.

Contribution

The novel genomic matrix approach captures low-frequency variants, improving mutation detection accuracy compared to consensus methods.

Findings

01

The genomic matrix method showed a 20% average accuracy improvement over consensus methods in simulated datasets.

02

Real-world data analysis revealed a 15% reduction in error margin using the genomic matrix approach.

03

The method provides a more accurate representation of SARS-CoV-2 genomic diversity and evolution.

Abstract

This study proposes a novel approach to studying severe acute respiratory syndrome coronavirus 2 virus mutations through sequencing data comparison. Traditional consensus-based methods, which focus on the most common nucleotide at each position, might overlook or obscure the presence of low-frequency variants. Our method, in contrast, retains all sequenced nucleotides at each position, forming a genomic matrix. Utilizing simulated short reads from genomes with specified mutations, we contrasted our genomic matrix approach with the consensus sequence method. Our matrix methodology, across multiple simulated datasets, accurately reflected the known mutations with an average accuracy improvement of 20% over the consensus method. In real-world tests using data from GISAID and NCBI-SRA, our approach demonstrated an increase in reliability by reducing the error margin by approximately 15%.…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Severe acute respiratory syndrome coronavirus 2(no rank)

Diseases1

severe acute respiratory syndrome

Figures9

Click any figure to enlarge with its caption.

Schematic diagram of the data processing and the simulation construction.

Visualization of the three approaches. The numbers on the axes refer to the order of the simulated sequencing samples listed in Table 1.

Deletions around the genome marked with ‘-’ in purple

Visualization of the dendrograms of the genome matrix, consensus-based approach, and the true composition’s distance matrices. It can be seen that compared with the true composition, the consensus-based approach amplifies or smoothens the differences, whereas based on genome matrices, we do not obtain such outliers in the relationships.

Visualization of the similarity of real data using heatmaps.

Deletions along the EPI_ISL_665636 genome marked with ‘-’ in purple.

Zooming into the deletion sites within the ERR4892461 genome. In the left column, it can be seen that ‘holes’ appear in the number of nucleotides at the deletion sites. On the right, zooming into these positions, we can see that there are also alternative nucleotides here, which however, stand out from the noise of the non-deletion sites.

The 16th sample greatly differs from the others, but according to the genome matrix, it shows similarity with several samples. The consensus-based approach organizes the samples into fewer clusters. Accordingly, we concluded here, similar to what was observed in simulations where the true composition was known, that the representation of variability decreases when using consensus.

Entropy sequences along the entire genome: visualization of the information content of genomes.

Tables5

Table 1. The table represents the true composition (coverage) of the simulated sample mixtures

sample	cheesy	glorious	mushroom	perfume	slinky
1	50	100	150	200	250
2	250	50	50	10	10
3	30	20	100	150	70
4	10	200	40	30	10
5	100	100	100	100	100
6	10	20	15	500	5
7	40	120	0	10	200
8	30	500	70	0	20

Table 2. Real data in different databases

	GISAID ID	SRA ID
1.	EPI_ISL_584662	ERR4692877
2.	EPI_ISL_584663	ERR4693014
3.	EPI_ISL_584664	ERR4693156
4.	EPI_ISL_664742	ERR4892128
5.	EPI_ISL_665153	ERR4891061
6.	EPI_ISL_665157	ERR4890337
7.	EPI_ISL_665158	ERR4891869
8.	EPI_ISL_665166	ERR4890532
9.	EPI_ISL_665169	ERR4891959
10.	EPI_ISL_665198	ERR4890285
11.	EPI_ISL_665199	ERR4892877
12.	EPI_ISL_665203	ERR4891675
13.	EPI_ISL_665231	ERR4891211
14.	EPI_ISL_665232	ERR4893062
15.	EPI_ISL_665233	ERR4892295
16.	EPI_ISL_665636	ERR4892461
17.	EPI_ISL_665848	ERR4891212

Table 3. Comparison of the genome matrix and consensus-based approach

Comparison	Distance	Pearson correlation	P-value
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $d(D_{GM}, D_{cons})$\end{document}	0.468	0.780	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $<10^{-15}$\end{document}
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $d(D_{GM}, D_{true})$\end{document}	0.232	0.888	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $<10^{-15}$\end{document}
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $d(D_{cons}, D_{true})$\end{document}	0.454	0.791	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $<10^{-15}$\end{document}

Table 4. Comparison of distance and Pearson correlation for genome matrices and consensus

Comparison	Distance	Pearson correlation	P-value
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $d(D_{realGM}, D_{GIs})$\end{document}	0.335	0.759	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $<10^{-15}$\end{document}

Table 5. Comparison of genome matrix method and conventional consensus method

Feature	Genome matrix method	Conventional consensus method
Data retention	Retains all nucleotide sequences at each position	Focuses on the most common nucleotide per position
Accuracy in mutation detection	High, by preserving the detail of low-frequency variants	Lower, due to potential omission of low-frequency variants
Method validation	Tested with both simulated and real-world datasets from GISAID and NCBI-SRA	Validated with less diverse datasets
Measurement of diversity	Uses distance matrices and Pearson correlation to closely reflect true genetic compositions	Relies on simpler genetic comparisons, potentially overlooking within-host diversity
Handling of low-frequency variants	Explicit focus on including and analyzing variants with low frequency	Often ignores or under-represents low-frequency variants
Epidemiological insights	Provides superior insights into virus evolution and epidemiology by accurately capturing genomic diversity	Less effective in capturing comprehensive genomic diversity
Applications recommended	Future applications in broader viral genomic studies and potential adaptation to other pathogens	Limited to standard epidemiological tracking and mutation analysis

Equations4

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& D_{i,j} = \sqrt{\sum_{k=1}^{4} \sum_{g=1}^{l} (G_{k,g}^{(i)}-G_{k,g}^{(j)})^{2}}\end{align*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& \left\| A \right\|_{F} = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} |a_{ij}|^{2}},\end{align*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& d(A,B) = \sqrt{\sum_{i=1}^{n}\sum_{j=1}^{m}(A_{ij}-B_{ij})^{2}},\end{align*}\end{document}

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& H = - \sum_{i} p_{i} \log_{2} p_{i},\end{align*}\end{document}

Funding1

—EU Horizon 2020

Keywords

NGS sequencingviral variantsfull variation profile analysisSARS-CoV-2

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Evolution and Genetic Dynamics · Bacteriophages and microbial interactions

Full text

Introduction

In conventional approaches to sequencing the genome of viruses, a consensus sequence is determined [1, 2]. The consensus sequence represents the most frequent nucleotide at each position following short-read alignment. There have been attempts to align a large number of consensus sequences with each other, thus obtaining frequencies assignable to each position [3]. However, one sequence alone does not necessarily accurately represent the composition of the sample, as it represents only a single sequence. Meanwhile, a sample comprises a population of many viruses, with different variants potentially carrying various mutations. This form of reduction results in a loss of information, potentially distorting sample similarity when comparing consensus sequences, and neglecting the ‘within-host diversity’ that characterizes the viral population within a host.

Numerous strategies have been developed to maintain this variety [4]. Notably, most efforts are documented in the HIV literature, referring predominantly to an approach known as profile sampling. The common factor in these methods is that they assign a frequency to each nucleotide (or codon) following read alignment, creating what constitutes the patient’s HIV profile. The effectiveness of this approach can be enhanced by creating multiple samples from a single frequency table and determining the within-host distribution using these developed profiles through repeated sampling [5]. The thinking here goes beyond consensus, shifting away from focusing on specific strains. However, we utilize one sample per patient, with our primary interest lying in the relations between these samples, rather than the distribution of strains within a sample.

To address this, we aim to explore how a more accurate depiction of the relationship between individual genomic sequences can be achieved. In this method, the comparison of genome matrices forms the basis for determining the degree of similarity, as opposed to consensus sequences that represent only a single sequence. Each row of the genome matrix represents a genomic position, while the columns indicate the frequency of individual nucleotides. As we have increased the number of dimensions in this approach (since each base now has its own), we expect to lose less information in the comparison process.

We will examine the efficacy of the two methods using simulated sequencing results, then compare the findings and examine which scenario provides a more accurate picture of the known relationships. Finally, we will conduct similarity studies according to the two methods using actual coronavirus sequencing data.

Materials and Methods

Figure 1 shows the workflow that was performed, and later described in details.

Schematic diagram of the data processing and the simulation construction.

Bioinformatic analysis of the simulated reads

First, we generated synthetic samples based on the Table 1.

In each sample, we precisely defined the mutations that define the variants present in the simulation [6], as well as the coverage of each variant. For this, we used the randomreads.sh tool from BBMap [7]. We used FASTA files prepared based on the individual variants as reference genomes, generating paired-end reads with a length of 150 base pairs and the desired coverage value. We combined the paired-end reads into a single file. The $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ sequence.fasta $\end{document}$ file is the reference genome of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which we indexed with BWA (Version: 0.7.17-r1188), and aligned the generated FASTQ files [8]. We transformed the resulting .sam file into a .bam file using the samtools view -Sb command [9]. Then we sorted the .bam file using the samtools sort command and indexed it using the samtools index command, creating the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ .bai $\end{document}$ files. The consensus sequence was created using bcftools consensus. Firstly, we generated the mpileup file using the bcftools mpileup - Ou -f tool with the reference genome and the .bam file. Afterwards, we utilized the command ‘bcftools call –ploidy 1 -mv -Oz -o’ to generate the ‘vcf’ file. This file was then employed to create the consensus using the command ‘bcftools consensus -f’ [10].

The idea for the genome matrix was given by the Position Probability Matrix used for characterizing sequential motifs [11]. We used the Biostrings, GenomicAlignments, tidyverse packages in R for the .bam files [12–14]. Essentially, the genome matrix made a .csv file from the .bam files, where each row contained the counted number of nucleotides aligned at the given positions.

Sources of the real data

In addition to the simulations, we also carried out the comparison on real data. For this, we had to find samples for which the raw FASTQ file can be found in the NCBI-SRA database [15] and the consensus in the GISAID database [16] as shown in Table 2. We selected the reads in such a way that there should be at least 20-fold coverage in at least 0.95 of the genome. Additionally, we worked with paired-end sequencing WGS data.

From the SRA, we aligned the reads downloaded with the SRA Toolkit prefetch to the reference COVID genome using the bowtie2 tool [17]. We use the bowtie2-build command to index the reference genome, then we align the reads (paired-end) and thus obtain .sam files. The samtools view -Sb command converts the .sam file into binary .bam format, which we similarly sorted and then indexed as in the case of synthetic reads. The genome matrix was also obtained here based on the .bam files by counting the nucleotides per position.

Comparing the effectiveness of the two methods

The question was to determine whether the relationships or similarities between the samples could be better determined by comparing consensus sequences or by comparing genome matrices. For this, we created distance matrices between the consensus sequences for the eight samples, similarly for the genome matrices. We first aligned the consensus sequences with each other in R using AlignSeqs(), and to visualize this we used BrowseSeqs(). Then, we determined the distance matrix for this with the DECIPHER DistanceMatrix() function [18]. We used the default parameters, that is, we calculated with Euclidean distance. We determined the distance matrix from the genome matrices in Python using the scipy.spatial distance.euclidean() function [19]. The $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ D_{i,j} $\end{document}$ matrix element contains the distance between the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ i $\end{document}$ -th and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ j $\end{document}$ -th genome matrices. We use Euclidean distance as a metric. Accordingly, the distance between two matrices, i.e. an element of the distance matrix, is obtained with the following formula:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& D_{i,j} = \sqrt{\sum_{k=1}^{4} \sum_{g=1}^{l} (G_{k,g}^{(i)}-G_{k,g}^{(j)})^{2}}\end{align*}\end{document}

In the above formula, the notations $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ G_{k,g}^{(i)} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ G_{k,g}^{(j)} $\end{document}$ represent the values of the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ i $\end{document}$ -th and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ j $\end{document}$ -th genome matrices at the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k $\end{document}$ -th base and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ g $\end{document}$ -th position.

We also created a distance matrix for the true compositions, the result of which also became an 8x8 matrix. This way, we obtained three 8x8 distance matrices. However, before comparing these, we normalized them. We calculated the Frobenius norm [20] as follows:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& \left\| A \right\|_{F} = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} |a_{ij}|^{2}},\end{align*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ A $\end{document}$ is a matrix of size $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ m \times n $\end{document}$ , and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ a_{ij} $\end{document}$ are the elements of the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ i $\end{document}$ -th row and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ j $\end{document}$ -th column of matrix $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ A $\end{document}$ .

Comparison of the distance matrices

For the comparison of the different methods, we used the distance matrices as a basis. Firstly, we determined the distance (Euclidean) between the distance matrices pairwise, as shown below.

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& d(A,B) = \sqrt{\sum_{i=1}^{n}\sum_{j=1}^{m}(A_{ij}-B_{ij})^{2}},\end{align*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ A $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ B $\end{document}$ are two $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ n \times m $\end{document}$ dimensional distance matrices, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ A_{ij} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ B_{ij} $\end{document}$ are elements in the same position.

Secondly, we determined the Pearson correlation between the distance matrices pairwise, which measured the linear relationship between the data, but not necessarily the causal relationship. In Python, we imported the necessary packages using the command from scipy.stats import pearsonr, and calculated the degree of correlation between each pair of distance matrices [19]. However, before that, we ‘flattened’ the matrices for analysis using the flatten() function. The pearsonr function not only provides the correlation coefficient but also a number called a ‘P-value’, which indicates the probability of whether the correlation between the two samples is real or just a result of chance. In addition, we visualized the relationships expressed by the distance matrices with heat maps and dendrograms. For the heat maps, we used the Python seaborn and matplotlib packages. We used the dendrogram and linkage functions from the scipy.cluster.hierarchy module to create dendrograms, based on the already prepared distance matrices.

The role of entropy in the search for polymorphisms

In a large portion of the genome, there is a nucleotide that occurs significantly more frequently at a given position than the others [21]. In these cases, the frequency of the other nucleotides is typically nonzero, possibly due to statistical or sequencing errors. It is crucial to note the distortion when normalizing the genome matrix, assigning relative frequencies to each base. Therefore, a variant with high AF at a highly covered position is characterized identically to potential noise at a less covered spot. Hence, there is a need for deep sequencing data to distinguish noise from low AF variants. A valuable adjunctive method involves generating an entropy sequence for the genome matrix, providing the ‘information’ content in different parts of the sequence. The entropy of the genome sequence at certain positions indicates how varied the bases are, i.e. how likely we are to encounter a given base at that position. Calculating entropy helps identify conservative and diverse areas of the genome sequence. Entropy can be calculated using the following formula:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& H = - \sum_{i} p_{i} \log_{2} p_{i},\end{align*}\end{document}

where H is the entropy, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ p_{i} $\end{document}$ is the probability (frequency) of the i-th base at a certain position.

In the present case, the entropy can be calculated as follows:

First, calculate the probability of occurrence (normalized frequency) of each base along the rows, i.e. at every position:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{align*}& p_{i} = \frac{n_{i}}{\sum_{j} n_{j}},\end{align*}\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ n_{i} $\end{document}$ is the number of occurrences of the i-th base, and summation is carried out over all possible bases ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ j $\end{document}$ ). We already had these values during the normalization of the genome matrix.Calculate the entropy based on normalized frequencies using (4).

Results and discussion

Simulation

We denoted $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ d(D_{GM}, D_{cons}) $\end{document}$ as the distance between the distance matrix of the genome matrices and the distance matrix of the consensuses. $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ d(D_{GM}, D_{true}) $\end{document}$ represents the distance between the distance matrix of the genome matrices and the distance matrix of the true compositions per sample. Finally, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ d(D_{true}, D_{cons}) $\end{document}$ denotes the distance between the distance matrix of the true compositions and the distance matrix of the consensus sequences. Using these notations, the results is summarized in Table 3.

It can be seen that in terms of distances, the reality is better captured by the genome matrix than the consensus-based approach.

As seen in Figure 2, compared with the true compositions, the consensus approach oversimplifies the results. Seeing the composition of the outlier sample based on Table 1, we can see that in this sample, the variant with the fantasy name perfume is dominant. The variant definitions show that there are two deletions among the mutations defining the variant, in addition to the SNPs. Their locations are $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ 11287 $\end{document}$ (G after deletion of GTCTGGTTTT), and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ 22298 $\end{document}$ (A after deletion of AGAAGTTATTTGACTCCTGGTG). Variants that did not contain deletions were cheesy, glorious. Hence, it logically follows that sample 6 manifests the most significant deviation from those samples (2,4,7,8) in which these deletion-free variants were predominantly observed. Only in the case of sample 6 did the deleted section make it into the consensus sequences. This is clearly visible at the locations of deletion in Figure 3.

Visualization of the three approaches. The numbers on the axes refer to the order of the simulated sequencing samples listed in Table 1.

Deletions around the genome marked with ‘-’ in purple

Despite every sample containing a variant that had one of the deletions alongside the SNPs, this only appeared in the consensus sequence of sample 6. The reason for this is that this variant was present with a sufficiently high AF in that sample. In the case of the genome matrix approach, as we could see on the heatmaps, sample 6 was less of an outlier compared with the others. From this, we concluded that the genome matrix is able to take into account deletions with lower AF during comparison.

A dendrogram is a tree-structured diagram that applies hierarchical clustering as seen in Figure 4. In this case, the synthetically created mixed samples are placed on the horizontal axis (x-axis), while the distances can be seen on the vertical axis (y-axis). From Figure 4 we can make the following observations.

According to the dendrogram of the true compositions, two main clusters can be observed. This is understandable if we look at the known composition ratios (Table 1), as in the case of samples 6 and 8, one variant accounts for more than 0.8 parts of the sample.According to the true compositions diagram, despite their differences, the samples are at most 0.4 distance from each other. The genome matrix approach slightly amplified this, while the consensus-based procedure nearly doubled it.The genome matrix approach performs much better than the consensus-based approach, according to the dendrograms, if a dominant variant contains a deletion segment. In the case of samples 7 and 3, such a variant is dominant in the sample. The consensus-based approach ‘leveled out’ the differences in these samples, resulting in a falsely decreased distance.The genome matrix and the consensus-based approach provide similar results in the case where the dominant variants in the samples do not contain a deletion and the other variants are relatively evenly represented. Sample 2 was also like this.When a variant containing a deletion segment is dominant, and the other variants (with or without deletions) occur in the sample with a low AF, the consensus method distorts the dendrogram. This is particularly noticeable in the case of sample 6, where the genome matrix does not show such early branching, which is in line with our true composition observations.

Visualization of the dendrograms of the genome matrix, consensus-based approach, and the true composition’s distance matrices. It can be seen that compared with the true composition, the consensus-based approach amplifies or smoothens the differences, whereas based on genome matrices, we do not obtain such outliers in the relationships.

Real data

We downloaded prepared FASTA files from the GISAID database, so it was not necessary to create a consensus sequence from the raw reads as in the case of simulation data. As the sequence lengths in GISAID do not necessarily match, we first performed the MSA on the 17 sequences and the reference genome in R before determining the distance matrix. Both distance matrices were normalized in a similar way for comparability, using Frobenius normalization. Subsequently, we determined the distance and Pearson correlation coefficient of the two distance matrices, similar to the simulations, with the difference that now we did not have the ‘ground truth’ composition distance matrix available in Table 4.

Comparing this with the simulation results, we see that the distance, in this case, is a tenth smaller, but the Pearson correlation between the two matrices is similar. To get a more precise picture of the relation between the two methods, we also compared the two distance matrices using heatmaps (Figure 5).

Visualization of the similarity of real data using heatmaps.

We saw a similar pattern in the simulation results (Figure 2) as in Figure 5. There was also a tendency that distances were smaller in the case of genome matrices, thus the differences between individual samples. On the other hand, in the consensus-based approach, some samples were very separated from others. According to the lesson of the simulation, this was due to variants containing dominant deletions. In Figure 5, a standout sequence (16th sample: EPI_ISL_665636, ERR4892461) can be seen based on the distances from the consensus-based approach.

Figure 6 contains several deletion sections (5), only one of which is present in other samples. Furthermore, the other samples do not contain a deletion section that can only be detected in it.

Deletions along the EPI_ISL_665636 genome marked with ‘-’ in purple.

We also looked at these regions based on the genome matrix results. Two trends seemed to be moving on the chart, so we made 2 columns in Figure 7. It is observable that the genome matrix methodology also detects comparable deletion sections, like the consensus-based approach. However, it is important to note that the frequency of nucleotide hits at these locations does not reduce to a complete zero. This is well visible in the right column as well. In the positions of deletion, an opposite trend appears in this column. In these positions, the number of individual nucleotide hits increases among the lower number of hits. This can also mean that in addition to mis-sequencing, a significant increase can be found in these positions in terms of ‘secondary’ nucleotides. There will not be much difference between the most common and the second most common nucleotide, as each is relatively low compared with the genomic environment.

Zooming into the deletion sites within the ERR4892461 genome. In the left column, it can be seen that ‘holes’ appear in the number of nucleotides at the deletion sites. On the right, zooming into these positions, we can see that there are also alternative nucleotides here, which however, stand out from the noise of the non-deletion sites.

This may mean that in addition to the variant containing a large AF deletion, there are several variants with smaller AFs and these genomic positions are more variable in the sample. However, before we search for these high entropy genomic regions from an information theoretical point of view, we examine the relationship between the samples with dendrograms on Figure 8.

The 16th sample greatly differs from the others, but according to the genome matrix, it shows similarity with several samples. The consensus-based approach organizes the samples into fewer clusters. Accordingly, we concluded here, similar to what was observed in simulations where the true composition was known, that the representation of variability decreases when using consensus.

Entropy sequences

When detecting deletions, and plotting the total number of nucleotides, we get similar figures as in Figure 7. In such instances, it is easier to notice low AF variants because they suddenly become dominant as the variant with the highest AF contains some mutation. In the case of normalized genome matrices, these low AF variants also appear, but in a distorted form, since these are relative ratios. Therefore, we used the concept of entropy for a more effective exploration of polymorphisms along the genome, as this method was applicable regardless of alignment. Entropy allows for the measurement of variability and distribution of nucleotides within certain sections, making it easier to identify low AF variants. These could potentially be linked to key features of the examined genome.

For the real data, a low entropy value indicates that there is little variability at a given position, and the bases are likely to be the same with a high probability. High entropy values indicate that the bases are diverse and their probabilities are more evenly distributed. Examining entropy values allows for the mapping of the conservative and diverse areas of the sequence. Our data clearly show that for sample 16, around 11 300, a higher entropy (higher information content) section is indeed noticeable, which is present in this sample to this extent only. This defining mutation might also be present in others, though this is not clear from the consensus. However, according to the information sequences formed from genome matrices, it is there with low AF. The increased entropy along the genome section in question indicates this.

Figure 9 shows that the entropy of the ERR4693014 sample is outstandingly high throughout the genome compared with the other genomes. The reason for this could be that it likely contained a mixture of several variants in almost equal measure at the time of sequencing. In this case, even if several strains are present, the consensus sequence will be a single one, making it understandable why it is located so far from the ERR4892461 sample on dendrograms (Figure 8). In the latter case, the distribution of information content along the genome is less uniform. This could indicate that a dominant strain was present in the sample, whose consensus sequence differed from that of the ERR4693014 sample, which uniformly incorporated a mixture of many strains.

Entropy sequences along the entire genome: visualization of the information content of genomes.

In the simulation, we utilized complete knowledge of sample compositions and mutations. We compared consensus sequences and genome matrices, both reliant on Euclidean distance matrices. After Frobenius normalization, we found the genome matrix method superior to the consensus-based approach.

The key takeaways are the following (Table 5):

The genome matrix’s distance matrix was more alike the ‘ground truth’, hence more realistic.The Pearson correlation between the genome matrix and the true composition distance matrix was higher than that with the consensus, indicating a stronger connection between initial and post-processing data.Visualization via heatmaps (Figure 2) showed that the consensus approach overemphasizes inhomogeneities in mixing ratios, potentially skewing phylogenetic tree generation. No similar distortion was observed in genome matrices.

High entropy denotes high variability, indicating a surprising or high-information content base type. Visualizing entropy values unveils a base entropic background due to sequencing errors or statistical noise. Polymorphisms, with their peaks rising in high entropy, differentiate from this background. The method’s advantage lies in its applicability to various mutation types, be it SNP or deletion/insertion.

Conclusion and future perspectives

This method enables efficient identification of genome anomalies, with the genome matrix approach playing a pivotal role. This allows for more realistic sample comparisons even without precisely determining specific variants and their ratios. While our findings underscore significant advances in capturing the extensive diversity of viral genomes, there are notable limitations to consider. The genome matrix method involves processing all nucleotides at each genomic position, which significantly escalates the complexity and computational demands. This complexity not only requires more robust computing infrastructure but also increases the sensitivity to noise from low-frequency variants, which might be artifacts rather than true mutations. Such challenges necessitate advanced filtering techniques and can complicate the analysis further.

Despite these challenges, the genome matrix method offers substantial prospects for advancing our understanding of viral genetics. Its ability to provide a detailed view of the viral genome is crucial for accurately tracking viral evolution and mutation trends. This method could be adapted beyond SARS-CoV-2 to other viruses and even broader applications in microbial and complex organism genomics.

Looking ahead, one of the most promising extensions of this research is the potential to predict virulence variants. Virulence variants significantly impact the pathogenicity of viruses; thus, their early detection is crucial for managing epidemics and informing therapeutic strategies. The genome matrix method, with its detailed capture of genomic information, is ideally positioned to identify mutations associated with changes in virulence. Moreover, this is particularly important as it allows for the matrix generated from sequencing data of a clinical sample to potentially play a prognostic role. However, realizing this potential requires integrating the genomic data with predictive modeling techniques, such as machine learning, which can analyze mutation patterns and correlate these with phenotypic outcomes. This approach, while promising, hinges on the availability of comprehensive datasets that link genomic data with detailed clinical outcome.

Supplementary Material

jabbrv_bbae296

naturemag-doi_bbae296

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Mardis ER . Next-generation dna sequencing methods. Annu Rev Genomics Hum Genet 2008;9:387–402. 10.1146/annurev.genom.9.081307.164359.18576944 · doi ↗ · pubmed ↗
2Miller JR , Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics 2010;95:315–27. 10.1016/j.ygeno.2010.03.001.20211242 PMC 2874646 · doi ↗ · pubmed ↗
3Caraballo-Ortiz MA , Miura S, Sanderford M. et al. Tophap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity. Bioinformatics 2022;38:2719–26. 10.1093/bioinformatics/btac 186.35561179 PMC 9113349 · doi ↗ · pubmed ↗
4Gribskov M , Veretnik S. Identification of sequence patterns with profile analysis. In: Methods in Enzymology. Academic Press, 1996, Vol. 266, pp. 198–212.8743686 10.1016/s 0076-6879(96)66015-7 · doi ↗ · pubmed ↗
5Guang A , Howison M, Ledingham L. et al. Incorporating within-host diversity in phylogenetic analyses for detecting clusters of new hiv diagnoses. Front Microbiol 2022;12:803190. 10.3389/fmicb.2021.803190.35250908 PMC 8891961 · doi ↗ · pubmed ↗
6UK Health Security Agency . Standardised variant definitions. Git Hub. https://github.com/ukhsa-collaboration/variant_definitions (24 May 2023, date last accessed).
7Bushnell B . BB Map: a fast, accurate, splice-aware aligner. Tech Rep. Berkeley, CA (United States): Lawrence Berkeley National Lab (LBNL), 2014. Source Forge. https://sourceforge.net/projects/bbmap/.
8Li H . Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.ar Xiv preprint ar Xiv:1303.3997. 2013. 10.48550/ar Xiv.1303.3997. · doi ↗

sample	cheesy	glorious	mushroom	perfume	slinky
1	50	100	150	200	250
2	250	50	50	10	10
3	30	20	100	150	70
4	10	200	40	30	10
5	100	100	100	100	100
6	10	20	15	500	5
7	40	120	0	10	200
8	30	500	70	0	20

sample	cheesy	glorious	mushroom	perfume	slinky
1	50	100	150	200	250
2	250	50	50	10	10
3	30	20	100	150	70
4	10	200	40	30	10
5	100	100	100	100	100
6	10	20	15	500	5
7	40	120	0	10	200
8	30	500	70	0	20

sample	cheesy	glorious	mushroom	perfume	slinky
1	50	100	150	200	250
2	250	50	50	10	10
3	30	20	100	150	70
4	10	200	40	30	10
5	100	100	100	100	100
6	10	20	15	500	5
7	40	120	0	10	200
8	30	500	70	0	20