A novel and efficient feature extraction algorithm using kmer-derived mutation signal

JingJing Zhang; XinGong Zhang; Jianwen Huang; RunBin Tang

PMC · DOI:10.7717/peerj.20940·March 11, 2026

A novel and efficient feature extraction algorithm using kmer-derived mutation signal

JingJing Zhang, XinGong Zhang, Jianwen Huang, RunBin Tang

PDF

Open Access

TL;DR

This paper introduces a new genomic feature extraction algorithm using kmer-derived mutation signals to identify biologically significant genomic regions and improve classification and phylogenetic analysis.

Contribution

A novel feature extraction algorithm using kmer-derived mutation signals to identify biologically significant genomic anchors and weighted interval entropy.

Findings

01

The proposed algorithm outperforms position and frequency-based kmer methods in phylogenetic tree construction and machine learning classification.

02

Anchor analysis in EBOV and HCV genomes reveals mutation types and directions, useful for diagnosing viral evolution.

03

Weighted interval entropy effectively characterizes cumulative information content of genomic anchors.

Abstract

Effective algorithms for extracting genomic features are crucial for downstream bioinformatics research. Although kmer-based descriptive statistical features (frequency and position) offer a unique perspective for genomic feature extraction, their biological significance warrants further investigation. Therefore, uncovering the biological significance of kmers in genomes remains a significant challenge in current kmer-based genomic feature mining. In this work, based on reverse, complementary and reverse-complementary of kmers to simulate genomic accumulation mutation behavior, we proposed a novel feature extraction algorithm. By examining the elastic length region following each kmer for the presence of any of these three scenarios, kmers of potential biological significance are identified and defined them as anchors of sequence. To characterize cumulative information content of anchor…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Ebola virus(no rank)HCV

Diseases1

Ebola

Figures5

Click any figure to enlarge with its caption.

The flowchart of identifying anchor and the weighted interval entropy.Anchor selection flowchart. (A) Analyzing three types of mutations. (B) Three types of searching anchor points. (C) Adopting a flexible setting in the process of anchor recognition. (D) The weighted entropy \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $b{w_\alpha }$\end{document}bwα for an anchor \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $\alpha$\end{document}α was defined as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ${w_\alpha } \times {b_\alpha }$\end{document}wα×bα, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ${b_\alpha }$\end{document}bα is the interval entropy and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ${w_\alpha }$\end{document}wα is a weight based on the interval of the anchor.

(A–D) Trend chart of the control variables for kmer length and elastic region length.The relationship between parameters k, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $\varepsilon$\end{document}ε, and the total entropy value.

The result of phylogenetic analysis at HCV.Phylogenetic tree of HCV virus based on weighted interval entropy of anchor with parameters set to (4, 13).

The result of phylogenetic analysis at Ebola.Phylogenetic tree of Ebola virus based on weighted interval entropy of anchor with parameters set to (4, 13).

Machine learning (SVM) flowchart.The sequence feature vectors extracted by various alignment-free algorithms were used as input for machine learning, and supervised learning was performed.

Funding4

—National Natural Science Foundation of China
—Science and Technology Research Program of Chongqing Municipal Education Commission
—Chongqing Normal University Foundation Project
—Nature Science Foundation Project of Chongqing Science and Technology Bureau

Keywords

Phylogenetic analysisMutation signal simulationKmerWeighted interval entropy

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFractal and DNA sequence analysis · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies

Full text

Introduction

Genome feature extraction is an important part of downstream bioinformatics research. When studying bioinformatics problems, efficient and concise genome features will make the constructed model interpretable and scalable. Extracting genome features based on continuous base fragments of k characters (kmer) can avoid the memory crisis and resource waste caused. The descriptive statistical characteristics of kmer in genomes have been widely used to extract genomic features, such as position (Ma et al., 2020) and frequency (Manekar & Sathe, 2018). Furthermore, the spacing signal of kmer in genome has also been used to explore uniqueness of genome. For example, by integrating spaced seeds into k-mers, a new and effective method for calculating genotypes was proposed (Häntze & Horton, 2023). PEER (Mitra, Bhattacharyya & Mukhopadhyay, 2020) mined feature of sequence based on waiting entropy of kmer, and conducted sequence classification research. KINN (Tang, Yu & Li, 2023) extracted features of sequence using internal distances between consecutive combinations of short characters to perform similarity comparisons between sequences.

Differences of molecular sequences are often directly due to internal and external environmental influences (Koonin, Dolja & Krupovic, 2022), such as the evolution of viruses. Virus is an important class of non-cellular organisms. Currently, viral species exhibit extremely high richness and diversity, driven by natural selection and directed mutation. In fact, once host is infected with virus, its immune system is activated to combat the invasion and suppress viral mutations. However, viruses employ targeted mutations to evade host immune responses, which in turn alter their genetic material (Koonin et al., 2020; Sanjuán & Domingo-Calap, 2016). Due to genomic mutations, structure of sequence is forced to change, leading to functional alterations in the mutated region (Domingo, 2016; Jia, Xu & Li, 2021), such as transmissibility, lethality and so on. Within the genome, mutations directly result in base duplications, insertions or deletions at the mutation site (Gordon & Halliday, 1995). Consequently, cumulative duplications of sequence segments are often found in genomes, and this tandem structure of sequence is used as a feature of sequence similarity, such as TRIM (Bhattacharyya, Mitra & Bhattacharyya, 2021). Furthermore, mutations break existing chemical bonds between bases, new chemical bonds are created to repair these bonds by indels of bases, which are also used to indicate tumor mutation behavior. For example, Ambigram (Li et al., 2023) simulated the progress of mutation and evolution of tumor, based on breakage-fusion-bridge events in the mutation process of tumor genomes.

In this work, we develop anchor, a novel genomic feature extraction algorithm, based on molecular cumulative mutation behavior and kmers, utilizing reverse complementary, reverse and self-copying of strings, to model different patterns of genomic mutational accumulation. In order to more effectively represent the cumulative mutation behaviors, we add the behaviors of insertion and deletion in the above several viral variants, i.e., an elastic region is added to the recognition process, which facilitates the recognition of more anchors. A kmer is identified as an anchor, if it possesses any of the three aforementioned properties within its subsequent elastic region. In fact, the number of anchors is much smaller than the number of kmers. To mitigate the problem of insufficient feature extraction due to a reduced number of anchors, we integrated the anchor interval signal entropy and defined penalty weight, based on its position in genome. In performance evaluation phase, we directly and indirectly tested the superior performance of the features on different viral dataset, by utilizing the based-anchor feature vectors to construct a phylogenetic tree and as input for machine learning, respectively. In our case study analysis, we selected the highly virulent Ebola pathogen and the common hepatitis C virus (HCV) virus to diagnostically profile their accumulated mutation patterns. By analyzing anchors in Ebola and HCV viruses of different types and time periods, we found that the continuous appearance and positional changes of anchors can be used to indicate the content and direction of accumulation of viral mutations.

Materials and Methods

Data sets

In this article, all viral sequences were obtained from NCBI (https://www.ncbi.nlm.nih.gov/). The size of the analysed dataset, the number of sequence types, and the average sequence length are presented in Table 1. The accession numbers for the sequences are detailed in the Supplemental Materials.

Table 1: Description of virus datasets used in the experiments.

Dataset 1: For phylogenetic tree reconstruction, we investigated the relationship between sequence anchors and mutations in the hepatitis C virus (HCV) and the Ebola virus. HCV, classified into genotypes 1–6, can lead to severe liver disease, including cirrhosis and hepatocellular carcinoma. Ebola virus comprises five species: Ebola-Zaire (EBOV), Sudan (SUDV), Bundibugyo (BDBV), Ta Forest (TAFV), and Reston (RESTV), with RESTV not pathogenic in humans. The data set has also been used in previous studies (Wang, Yu & Li, 2024; He et al., 2020). The detailed data can be found in Tables S1 through S4 within the Supplemental Materials.

Dataset 2: For machine learning analyses, we used a separate dataset as specified below. The detailed data can be found in Tables S9 through S1.

Definition of anchor point

Suppose DNA sequence is represented by $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ S = {S_{1}}{S_{2}} \cdots {S_{L}} $\end{document}$ , and L is the length of the sequence. Then for any $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ i \in { 1,2, \cdots ,L} $\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {S_{i}} \in \sigma = { A,C,G,T} $\end{document}$ . The $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ kmers $\end{document}$ of a sequence can be obtained from retrieving the sequence, based on a fixed-length ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k $\end{document}$ ) sliding window with a sliding step of 1. For a fixed $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k $\end{document}$ , there are $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {\left| \sigma \right|^k} $\end{document}$ possible forms of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ kmer $\end{document}$ , theoretically. But in fact, a sequence of length L has at most $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ L-k + 1 $\end{document}$ kinds of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ kmer $\end{document}$ . In the process of viral mutation, bases near the mutation site are duplicated and rearranged, and recorded in the genome (Li et al., 2023). Based on the copied and rearraged behavior, we categorize the mutation into three cases with kmer: reverse complement, reverse order and replication. Replication of kmer is occurrences of identical-length string (denoted as ks) after the kmer, when $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ ks = kmer $\end{document}$ . Reverse order of kmer is that ks is completely reversing the order of kmer, e.g., the reverse order of ATC is CTA. The reverse complement of kmer is generated by reversing and pairing, according to the principle of strict complement (A–T, C–G), e.g., the reverse complement of AAC is GTT. If a kmer belongs to one of the three mutation types above, the kmer is defined as an anchor. For example, a fragment of sequence is ACCGGT, kmer = ACC is an anchor, when k = 3.

In addition, insertion and deletion also occur frequently in virus mutation. So, we adopt a flexible setting in the process of anchor recognition. We increase the length of the region that identifies anchor. As shown in Fig. 1C, some bases can be inserted between kmer and ks. Anchors of sequence can be more recognized by increasing the elasticity setting, the parameter notation of the elasticity setting is denoted $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \varepsilon $\end{document}$ . So the length of region of recognizing anchor is less than or equal to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k + \varepsilon $\end{document}$ after kmer. Thus, elasticity setting is realistic in discovering small changes of sequence and conducting analysis of cumulative mutation in sequence.

The flowchart of identifying anchor and the weighted interval entropy.Anchor selection flowchart. (A) Analyzing three types of mutations. (B) Three types of searching anchor points. (C) Adopting a flexible setting in the process of anchor recognition. (D) The weighted entropy \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $b{w_\alpha }$ \end{document}bwα for an anchor \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $\alpha$ \end{document}α was defined as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ${w_\alpha } \times {b_\alpha }$ \end{document}wα×bα, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ${b_\alpha }$ \end{document}bα is the interval entropy and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} ${w_\alpha }$ \end{document}wα is a weight based on the interval of the anchor.

Sequence feature based on weighted interval entropy of anchor

Let $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ represent an anchor in sequence S. We denote the ordered occurrences of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ in S as $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {\alpha {1}},{\alpha {2}}, \cdots $\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {\alpha {N(\alpha ,s)}} $\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ N(\alpha ,S) $\end{document}$ stands for the total number of occurrences of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ in S, and we require $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ N(\alpha ,S) > 1 $\end{document}$ . Then $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {\alpha {i}} $\end{document}$ denotes the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ i $\end{document}$ -th occurrence of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ in S. We denote $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {P\alpha }(S) $\end{document}$ as the position vector of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ in S, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {P\alpha }\left( S \right) = \left( {{P{1}},{P{2}}, \cdots ,{P_{N(\alpha ,s)}}} \right) $\end{document}$ . Let $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {I_\alpha }(S) $\end{document}$ be the interval vector between consecutive anchors, i.e.,

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $${I_\alpha }\left( S \right) = \left( {{I_{1}},{I_{2}}, \cdots ,{I_{N(\alpha ,S) - 1}}} \right) = \left( {{P_{2}} - {P_{1}}, \cdots ,{P_{N(\alpha ,s)}} - {P_{N(\alpha ,s) - 1}}} \right),$$\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {I_{i}} $\end{document}$ is the i-th interval of the anchor point. Further normalize each interval, i.e.,

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $${p_{i,\alpha }} = {{{I_{i}}} \over {\sum\nolimits_{i = 1}^{N(\alpha ,s) - 1} {{I_{i}}} }},$$\end{document}

Thus, the normalized vector of the anchor points $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ is

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$p(\alpha ) = \left( {{p_{1,\alpha }},{p_{2,\alpha }}, \cdots ,{p_{N(\alpha ,s) - 1,\alpha }}} \right),$$\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \sum\nolimits_{i = 1}^{N(\alpha ,s) - 1} {{p_{i,\alpha }}} = 1 $\end{document}$ . Finally, we use the Shannon entropy (Shannon, 1948) to define interval entropy of the anchor point $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ in the sequence S:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $${b_\alpha } = - \sum\limits_{i = 1}^{N(\alpha ,s) - 1} {{p_{i,\alpha }}} \times {\log _{2}}{p_{i,\alpha }} = - \sum\limits_{i = 1}^{N(\alpha ,s) - 1} {{{{I_{i}}} \over {\sum\nolimits_{i = 1}^{N(\alpha ,s) - 1} {{I_{i}}} }}} \times {\log _{2}}{{{I_{i}}} \over {\sum\nolimits_{i = 1}^{N(\alpha ,s) - 1} {{I_{i}}} }}.$$\end{document}

On the other hand, if the last position of anchor point is far from the end of the sequence, interval entropy of the anchor should be constrained. Therefore, we assign weight to the anchor interval entropy. The weight is defined as follows:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $${w_\alpha } = {{L - {P_{N(\alpha ,s)}}} \over L} \times {\log _{2}}({{L - {P_{N(\alpha ,s)}}} \over L}),$$\end{document}

If the anchor point is exactly at the end of the sequence, then $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ L - {P_{N(\alpha ,s)}} = 2k + \varepsilon $\end{document}$ , for the general situation $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ L - {P_{N(\alpha ,s)}} \ge 2k + \varepsilon $\end{document}$ , and so $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ L - {P_{N(\alpha ,s)}} \ne 0 $\end{document}$ . On the other hand, since $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ N(\alpha ,S) > 1 $\end{document}$ , then $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {P_{N(\alpha ,s)}} \ne 0 $\end{document}$ , and then

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $${{L - {P_{N(\alpha ,s)}}} \over L}\lt1.$$\end{document}

Anchor points located near the start or end of the sequence have smaller $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {w_\alpha } $\end{document}$ values, indicating their relatively lower contribution. In other words, if the number of anchor points is abundant or scarce, its weighted interval entropy is low, which is consistent with the specificity of mutation. The weighted interval entropy $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ b{w_\alpha } $\end{document}$ of the anchor $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ is defined as follows:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$b{w_\alpha } = {w_\alpha } \times {b_\alpha },$$\end{document}

$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ b{w_\alpha } $\end{document}$ is a quantitative indicator of the importance of the anchor point $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ within the entire sequence.

Sequence numericalization based on anchor

For a sequence S, the weighted interval entropy of all anchors in sequence can be calculated based on weighted interval entropy bw described above. However, not all anchor in S are identified as anchors in other sequences, so the bw value of kmer, which is not considered anchors in other sequence, is defined as 0. Suppose there are n anchor types in the dataset. The kmer are sorted alphabetically and mapped to integers from 1 to n, and the sequence S is transformed into a numerical sequence V:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$V = \left( {b{w_{1}},b{w_{2}}, \cdots ,b{w_{n}}} \right).$$\end{document}

Parameter selection

When extracting anchors, the length of kmer and the elasticity length $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \varepsilon $\end{document}$ need to be set in advance. If the parameter k is too large (or too small), the number of occurrences of kmer will be relatively small (or large). Moreover, according to the definition of anchor, k should not be set too large. So, we define the total contribution value of the anchor $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ H(k,\varepsilon ) $\end{document}$ , to fine-tune two parameters.

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$H(k,\varepsilon ) = \sum\limits_{i}^{{4^k}} b {w_{i}}.$$\end{document}

Then, the relationship between $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ H(k,\varepsilon ) $\end{document}$ and the two parameters is discussed separately, based on the idea of control variate. We construct the relationship diagram (Fig. 2) between variation of the total entropy value and parameter, in all sequences of Ebola and HCV viruses, respectively. In Figs. 2A and 2C, by controlling the value of k, it is found that $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ H(k,\varepsilon ) $\end{document}$ is maximum when $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k = 4 $\end{document}$ . In Figs. 2B and 2D, by controlling the size of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \varepsilon $\end{document}$ , the trend of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ H(k,\varepsilon ) $\end{document}$ when $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k \le 3 $\end{document}$ is opposite to that of H when $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k > 3 $\end{document}$ . It is caused with the number of anchor points: when k is small, the kmers appear more frequently; So when $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \varepsilon $\end{document}$ increase, the probability of kmer becomes anchor will greatly increased. Meanwhile, when the length of anchor $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha $\end{document}$ interval vector is larger, the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {w_\alpha } $\end{document}$ becomes small. So $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ H(k,\varepsilon ) $\end{document}$ shows a decreasing trend when $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k \le 3 $\end{document}$ . From Figs. 2B and 2D, we can observe that when $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k \ge 4 $\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ H(k,\varepsilon ) $\end{document}$ increases with the value of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \varepsilon $\end{document}$ . When $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k = 4 $\end{document}$ in Fig. 2B, the slope of the curve is gradually flatter at $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \varepsilon = 13 $\end{document}$ . Hence, we set the parameters k and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \varepsilon $\end{document}$ to 4 and 13, respectively.

(A–D) Trend chart of the control variables for kmer length and elastic region length.The relationship between parameters k, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $\varepsilon$ \end{document}ε, and the total entropy value.

Results

Phylogenetic analysis based on anchors

To evaluate the performance of our algorithm, we conducted phylogenetic tree analysis. Any two sequences, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {S_{1}},{S_{2}} $\end{document}$ , can be converted into numerical vectors $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {V_{1}},{V_{2}} $\end{document}$ of equal dimension proposed method. The cosine distance of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {V_{1}},{V_{2}} $\end{document}$ is then computed as a measure of dissimilarity:

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\cos \left( {{V_{1}},{V_{2}}} \right) = {{{V_{1}} \cdot {V_{2}}} \over {\Vert{V_{1}}\Vert \cdot \Vert{V_{2}}\Vert}},$$\end{document}

where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {V_{1}} \cdot {V_{2}} $\end{document}$ denotes the dot product of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {V_{1}} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {V_{2}} $\end{document}$ , $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \Vert{V_{1}}\Vert $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \Vert{V_{2}}\Vert $\end{document}$ are mode of $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {V_{1}} $\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {V_{2}} $\end{document}$ , respectively.

In this work, phylogenetic analysis of HCV and Ebola in Dataset 1 were performed based on the sequence distance matrices respectively. The sequence distance matrix was used as input to Mega X (Kumar et al., 2018), and the phylogenetic tree was generated using the neighbor-joining method (NJ) (Saitou & Nei, 1987). An online tool, iTol (Letunic & Bork, 2021), was used to beautify the phylogenetic tree for a more friendly visual presentation.

A total of 82 HCV sequences were analyzed and the parameters $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k,\varepsilon $\end{document}$ were set to 4, 13 respectively. The results are shown in Fig. 3, and all six subtype groups of HCV genome sequences are clustered distinctly and in different branches, and the six subtypes are also correctly classified.

The result of phylogenetic analysis at HCV.Phylogenetic tree of HCV virus based on weighted interval entropy of anchor with parameters set to (4, 13).

When analyzing the classification problem for 59 Ebola sequences, the parameters $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ k,\varepsilon $\end{document}$ were also set to 4 and 13, respectively. The results, as shown in Fig. 4. The five types of Ebola virus can be clearly and efficiently distinguished from each other in the figure. In addition, the sequences in type EBOV, with three different collected time, can be clustered based on time. The above results suggest that the characterization of anchors can carve out some of the mutational order of the viral sequences.

The result of phylogenetic analysis at Ebola.Phylogenetic tree of Ebola virus based on weighted interval entropy of anchor with parameters set to (4, 13).

Performance evaluation based on machine learning

To demonstrate the ability of feature vectors to express the uniqueness of sequences, we conducted performance evaluation of the feature extraction methods, based on the SVM algorithm in classification machine learning. In the experiments, we directly imported the SVM function from Scikit-learn library in Python 3.8. The model used an RBF kernel with parameters: C = 6, gamma = “scale”, class weight = “balanced”. We set decision function shape to “ovr”, random state to 42 and max_iter to 2,000 for reproducibility and convergence. All datasets were split into training and test sets in an 8:2 ratio, and all experiments were conducted on Intel Core i5-1135G7 Quad-Core Processor and 8 GB of RAM. During feature extraction, the parameter k was set to 4. The frequency feature of kmers were normalized before inputting. We performed five-fold cross-validation to minimize the variability in evaluation due to the randomness of dataset splitting. The experimental procedure is illustrated in Fig. 5.

Machine learning (SVM) flowchart.The sequence feature vectors extracted by various alignment-free algorithms were used as input for machine learning, and supervised learning was performed.

Comparative experiment with partial feature of anchors

To evaluate the feature extraction capabilities of three types of patterns in kmer, we conducted separate comparative experiments on the three types of anchors. Such as each pattern independently, any combination of two patterns, and the fusion of all three patterns. As shown in Table 2, prediction accuracy of various combinations achieves high accuracy across all experimental settings. Furthermore, the performance differences among these approaches are marginal. More detailed comparison results are provided in Table S8 of Supplemental Material.

Table 2: Comparative experiment with partial feature anchors (Accuracy on Test Set Only).

Performance comparison with existing method

From methodological perspective, the frequency feature of kmer considers only the occurrence counts of kmers, while position-weighted k-mer (Pwkmer) method emphasizes their positional information. Our proposed method incorporates information about frequency, position and spacing to ensure a more informative representation of the sequence. Therefore, we evaluated our feature extraction method by comparing it with FCGR (Hatje & Kollmar, 2012), BBC (Liu, Meng & Sun, 2008), RTD (Kolekar, Kale & Kulkarni-Kale, 2012), Pwkmer and KINN, based on frequency, position and correlation of kmer, respectively. Given that each genomic sequence is assigned to a distinct category, every dataset in Dataset 2 of this manuscript represents a multi-class classification problem. Accordingly, we use each sequence’s feature vector as the model input and its category as the prediction target, constituting a multi-class machine learning task.

In experiments, the feature vector extracted by Pwkmer, were multiplied by 1,000 to mitigate bias from genomic scale differences. The final results are presented in Table 3. Due to hardware configuration limitations, feature extraction from the Human Immunodeficiency Virus (HIV) dataset based on KINN method was restricted. Consequently, the corresponding prediction results are not provided.

Table 3: Performance comparison of different methods on virus sequence validation and test datasets.

As shown in Table 3, the key performance metrics—Accuracy, F1 and MCC were evaluated on independent test sets for HIV, HCV, Dengue, and Influenza A. Through performance comparison on different datasets, our method demonstrated superior performance on among the HIV, HCV and Dengue datasets. On the Influenza A dataset, the performance of anchor is ranked third. In summary, anchor shows superior performance on datasets, particularly establishing a significant advantage in HIV virus recognition tasks.

Relationship between sequence variation and anchors

To explore the association between anchor and viral mutations, we analyzed the differences in anchor points contained in the five virus types in Ebola virus dataset and contained in the six virus types in HCV dataset, respectively. By examining whether an anchor appears within a particular type, we found anchor point CGCG appeared only in BDBV type and all the sequences under BDBV type contained this anchor point (see Table 4). Further examination revealed that CGCG only occurs at positions 5,893 and 12,040 in the sequence, so it can be considered as a mutation unique, suggesting potential biological implications. Since the phylogenetic tree shows that RESTV and SUDV are relatively close to each other, we analyzed the anchors that are common and absent from the five types and their locations. As shown in the Table 4, in which the anchor points based on AACG, ACGG, ACGT, CGGT, CGTA, etc., we can easily observe that RESTV and SUDV contain anchor points, and the number and position of anchor points (see in Table S3) are relatively close to each other. We observe that RESTV and SUDV do not contain CCGC, CGCG, GCGT, while the other types contain such anchors. It could explain why these two type strains are relatively close.

Table 4: Information on the location of different types of anchors.

Furthermore, we analyzed the anchor type and location of EBOV genome sequences from different sampling years (1994–1996, 2007); the result was shown in the Table 5. The results show that anchor CCCG only appears at the position 4,069 in 2007, and anchor GCGA only appears at the position 13,855 ( $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \pm 8 $\end{document}$ ) in the sequence collected in 2014.

Table 5: Anchor position information at different times in EBOV.

Next, the differences of anchor in sequences at different time were analyzed in more detail, results are shown in Table 6 . In the table, ACGC appears at all times except 2007, so it is considered as the base anchor. We find the position of ACGC is unchanged from 1976 to 2002, but it is absent in 2007 and appears again in 2014. Therefore, it may be considered that it has not been naturally selected after mutation. In addition, CCTA is present in position 13,077 from 1976 in 2007, and is absent in 2014. Between 1994 and 1996, the number of anchors in Table S4 increased and the mutation occurred at roughly the same location. However, no anchor appears in 2014, one reason may be that drug inhibition changed the mutation direction of virus. We also find that GCGA is not occurring in 1976–1996, but appear firstly once in 2002. Finally, it is stably preserved at the position near 13,800 in 2014. We can easily observe the position of GTAC is 13,114 in 1976–2014. However, the number of GTAC again increased and stable existence in 2014.

Table 6: Anchor locations for different time portions of the sequence in EBOV.

In addition, to compare positions of anchors in six different types of HCV dataset, we selected one sequence from each type and analyzed its anchor sites. Ultimately, we discovered that, similar to Ebola virus, partial anchor sites differed across types. The anchor site is absent in type 5 but present in most other types, indicating this characteristic is not unique to Ebola virus. The detail results are available in Table S7.

Anchor points, with a sudden increase and stable positions, should be considered as accumulation mutation points with significance in the process of virus evolution. The sudden absence of anchor points may mean that the direction of virus evolution has changed. Thus, the anchor points may be important for determining the direction and type of mutations.

Conclusions

In this study, we defined anchors based on molecular accumulation mutations behaviors, and proposed weighted interval entropy to quantify the information content of anchors in the genome. We used species phylogenetic trees and machine learning classification algorithms to directly and indirectly test the effectiveness of the anchor-based algorithm. Our analysis of the presence, position and number of genomic anchors revealed that they can serve as effective indicators for capturing common sequence features after mutations have accumulated. Our future research will focus on the relationship between anchors and the direction and path of viral accumulation mutations.

Supplemental Information

10.7717/peerj.20940/supp-1Supplemental Information 1The information of sequences of HCV and Ebola.The position and type of anchor in sequence of HCV and Ebola.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bhattacharyya B Mitra U Bhattacharyya R Tandem repeat interval pattern identifies animal taxa Bioinformatics 202137162250225810.1093/bioinformatics/btab 12433677492 · doi ↗ · pubmed ↗
2Domingo E Chapter 2 - Molecular basis of genetic variation of viruses: error-prone replication 2016 Virus as Populations Boston Academic Press 3571
3Gordon AJ Halliday JA Inversions with deletions and duplications Genetics 1995140141141410.1093/genetics/140.1.4117635304 PMC 1206566 · doi ↗ · pubmed ↗
4Hatje K Kollmar M A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method Frontiers in Plant Science 2012319210.3389/fpls.2012.0019222952468 PMC 3429886 · doi ↗ · pubmed ↗
5He L Dong R He RL Yau SS-T Positional correlation natural vector: a novel method for genome comparison International Journal of Molecular Sciences 20202111385910.3390/ijms 2111385932485813 PMC 7312176 · doi ↗ · pubmed ↗
6Häntze H Horton P Effects of spaced k-mers on alignment-free genotyping Bioinformatics 202339 i 213i 22110.1093/bioinformatics/btad 20237387138 PMC 10311327 · doi ↗ · pubmed ↗
7Jia W Xu C Li SC Resolving complex structures at oncovirus integration loci with conjugate graph Briefings in Bioinformatics 20212211310.1093/bib/bbab 35934463709 · doi ↗ · pubmed ↗
8Kolekar P Kale M Kulkarni-Kale U Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping Molecular Phylogenetics and Evolution 201265251052210.1016/j.ympev.2012.07.00322820020 · doi ↗ · pubmed ↗