Taenia Biomolecular Phylogeny and the Impact of Mitochondrial Genes on this Latter
Huda Al-Nayyef, Christophe Guyeux, Jacques M. Bahi

TL;DR
This study investigates how mitochondrial gene variations influence the phylogenetic analysis of the genus Taenia, identifying the most robust gene combinations for accurate evolutionary inference.
Contribution
It analyzes the impact of different mitochondrial gene combinations on phylogenetic tree robustness in Taenia species, highlighting the importance of specific genes for reliable phylogeny.
Findings
Only four gene combinations yielded relevant topologies.
A particular topology was identified with enhanced robustness.
Mitochondrial gene variability affects phylogenetic inference accuracy.
Abstract
Variations in mitochondrial genes are usually considered to infer phylogenies. However some of these genes are lesser constraint than other ones, and thus may blur the phylogenetic signals shared by the majority of the mitochondrial DNA sequences. To investigate such effects, in this research work, the molecular phylogeny of the genus Taenia is studied using 14 coding sequences extracted from mitochondrial genomes of 17 species. We constructed 16,384 trees, using a combination of 1 up to 14 genes. We obtained 131 topologies, and we showed that only four particular instances were relevant. Using further statistical investigations, we then extracted a particular topology, which displays more robustness properties.
| Species | Accession |
|---|---|
| Taenia asiatica | NC_004826 |
| Taenia crassiceps | NC_002547 |
| Taenia hydatigena | NC_012896 |
| Taenia krepkogorski | NC_021142 |
| Taenia laticollis | NC_021140 |
| Taenia madoquae | NC_021139 |
| Taenia martis | NC_020153 |
| Taenia multiceps | NC_012894 |
| Taenia mustelae | NC_021143 |
| Taenia ovis | NC_021138 |
| Taenia parva | NC_021141 |
| Taenia pisiformis | NC_013844 |
| Taenia saginata | NC_009938 |
| Taenia serialis | NC_021457 |
| Taenia solium | NC_004022 |
| Taenia taeniaeformis | NC_014768 |
| Taenia twitchelli | NC_021093 |
| Echinococcus vogeli | NC_009462 |
|
Hoberg et al. [12] |
Hoberg et al. [13] |
Hoberg [11] |
Nakao et al. [18] |
Lavikainen et al. [16] |
Knapp et al. [14] |
Nakao et al. [17] |
|
Total of studies |
|
| Taenia acinonyxi | X | X | X | 3 | |||||
| Taenia asiatica | X | X | X | X | X | X | X | X | 8 |
| Taenia brachyacantha | X | X | 2 | ||||||
| Taenia crassiceps | X | X | X | X | X | X | X | X | 8 |
| Taenia crocutae | X | X | X | X | 4 | ||||
| Taenia dinniki | X | X | 2 | ||||||
| Taenia endothoracicus | X | X | X | 3 | |||||
| Taenia gonyamai | X | X | X | 3 | |||||
| Taenia hyaenae | X | X | X | X | 4 | ||||
| Taenia hydatigena | X | X | X | X | X | X | 6 | ||
| Taenia ingwei | X | X | 2 | ||||||
| Taenia intermedia | X | 1 | |||||||
| Taenia krabbei | X | 1 | |||||||
| Taenia krepkogorski | X | X | 2 | ||||||
| Taenia laticollis | X | X | X | X | X | 5 | |||
| Taenia macrocystis | X | X | X | 3 | |||||
| Taenia madoquae | X | X | X | X | X | X | X | 7 | |
| Taenia martis | X | X | X | X | X | X | X | 7 | |
| Taenia multiceps | X | X | X | X | X | X | X | X | 8 |
| Taenia mustelae | X | X | X | X | X | X | X | 7 | |
| Taenia olngojinei | X | X | X | 3 | |||||
| Taenia omissa | X | X | X | 3 | |||||
| Taenia ovis | X | X | X | X | X | X | 7 | ||
| Taenia parenchymatosa | X | X | 3 | ||||||
| Taenia parva | X | X | X | X | X | X | X | 7 | |
| Taenia pencei | X | 1 | |||||||
| Taenia pisiformis | X | X | X | X | X | 5 | |||
| Taenia polyachantha | X | X | 3 | ||||||
| Taenia pseudolaticollis | X | X | 2 | ||||||
| Taenia regis | X | X | X | X | 4 | ||||
| Taenia rileyi | X | X | X | 3 | |||||
| Taenia saginata | X | X | X | X | X | X | X | X | 8 |
| Taenia selousi | X | X | X | 3 | |||||
| Taenia serialis | X | X | X | X | X | X | 8 | ||
| Taenia simbae | X | X | X | X | 4 | ||||
| Taenia solium | X | X | X | X | X | X | X | X | 8 |
| Taenia taeniaeformis | X | X | X | X | X | X | X | 8 | |
| Taenia taxidiensis | X | X | X | 3 | |||||
| Taenia twitchelli | X | X | X | X | X | X | X | 7 | |
| Echinococcus vogeli | X | X | + | * | 3 | ||||
| Total 40 | 34 | 35 | 31 | 11 | 18 | 16 | 16 | 17 |
| Topology | Lowest | Number of | Average | Discarded |
|---|---|---|---|---|
| bootstrap | occurrences | bootstrap | genes | |
| 0 | 82 | 2049 | 44 | Atp6, Cob, Cox2, Nad1, Nad2, Nad3, Nad5 |
| 1 | 84 | 6442 | 51 | Nad1, Nad3, Nad5, Nad6, Rrns |
| 2 | 92 | 3276 | 52 | Cox2, Cox3, Nad4, Nad4l, Nad5, Rrnl, Rrns |
| 3 | 76 | 931 | 48 | Atp6, Cox1, Nad1, Nad3, Nad4, Rrnl |
| 4 | 74 | 452 | 52 | Atp6, Cob, Cox1, Cox3, Nad4, Nad5, Rrnl |
| 5 | 56 | 317 | 28 | Cob, Cox1, Cox2, Cox3, Nad1, Nad2, Nad3, Nad4l, Rrnl, Rrns |
| 6 | 68 | 614 | 39 | Atp6, Cox1, Cox2, Cox3, Nad2, Nad3, Nad5 |
| 7 | 68 | 321 | 43 | Atp6, Cox2, Cox3, Nad1, Nad2, Nad3, Nad4, Nad4l, Nad6, Rrns |
| 8 | 70 | 226 | 46 | Cob, Cox1, Cox2, Cox3, Nad4, Nad4l |
| 9 | 58 | 69 | 39 | Cox1, Cox2, Cox3, Nad1, Nad3, Nad4, Rrns |
| 10 | 74 | 230 | 45 | Atp6, Cob, Cox1, Nad1, Nad2, Nad4, Nad4l, Nad6, Rrnl |
| 11 | 76 | 172 | 53 | Cob, Cox1, Cox2, Cox3, Nad1, Nad3, Nad4, Nad5, Rrnl |
| 12 | 60 | 212 | 30 | Atp6, Cox2, Cox3, Nad1, Nad2, Nad4l, Nad6, Rrns |
| 13 | 56 | 92 | 42 | Atp6, Cob, Cox1, Cox2, Cox3, Nad1, Nad3, Nad4 |
| 14 | 64 | 39 | 44 | Atp6, Cob, Cox1, Cox2, Nad3, Nad4, Nad5, Nad6, Rrns |
| Differences | Top. 0 | Top. 1 | Top. 2 | Top. 3 |
|---|---|---|---|---|
| Top. 0 | _ | T_laticollis T_pisiformis (RF=2) | T_laticollis T_pisiformis (RF= 4) | T_laticollis T_pisiformis T_solium (RF= 4) |
| Top. 1 | _ | _ | T_hydatigena (RF= 2) | T_solium (RF= 2) |
| Top. 2 | _ | _ | _ | T_hydatigena T_solium (RF= 4) |
| Top. 3 | _ | _ | _ | _ |
| Topologies | 0 | 1 | 2 | 3 | ||||
|---|---|---|---|---|---|---|---|---|
| number | rank | number | rank | number | rank | number | rank | |
| atp6 | 924 | 9 | 3431 | 8 | 1924 | 4 | 423 | 9 |
| cob | 787 | 11 | 4179 | 2 | 1691 | 6 | 350 | 11 |
| cox1 | 1209 | 4 | 4324 | 1 | 1326 | 11 | 542 | 5 |
| cox2 | 1152 | 6 | 3740 | 6 | 1472 | 8 | 686 | 2 |
| cox3 | 1469 | 3 | 3966 | 4 | 1260 | 13 | 449 | 7 |
| nad1 | 584 | 13 | 3379 | 12 | 2549 | 2 | 257 | 12 |
| nad2 | 84 | 14 | 3391 | 11 | 3004 | 1 | 527 | 6 |
| nad3 | 1142 | 7 | 2708 | 13 | 2069 | 3 | 448 | 8 |
| nad4 | 1699 | 1 | 3677 | 7 | 1339 | 10 | 84 | 13 |
| nad4l | 1153 | 5 | 3421 | 10 | 1291 | 12 | 624 | 3 |
| nad5 | 613 | 12 | 4139 | 3 | 694 | 14 | 883 | 1 |
| nad6 | 937 | 8 | 3421 | 9 | 1887 | 5 | 390 | 10 |
| rrnL | 858 | 10 | 3855 | 5 | 1583 | 7 | 63 | 14 |
| rrnS | 1638 | 2 | 2598 | 14 | 1392 | 9 | 603 | 4 |
| coef | std err | z | [95.0% Conf. Int.] | ||
|---|---|---|---|---|---|
| atp6 | -0.2412 | 0.034 | -7.06 | 0.000 | [-0.308, -0.174] |
| cob | 0.6861 | 0.035 | 19.871 | 0.000 | [0.618, 0.754] |
| cox1 | 0.8592 | 0.035 | 24.733 | 0.000 | [0.791, 0.927] |
| cox2 | 0.1444 | 0.034 | 4.231 | 0.000 | [0.078, 0.211] |
| cox3 | 0.4261 | 0.034 | 12.431 | 0.000 | [0.359, 0.493] |
| nad1 | -0.3059 | 0.034 | -8.944 | 0.000 | [-0.373, -0.239] |
| nad2 | -0.2915 | 0.034 | -8.526 | 0.000 | [-0.359, -0.224] |
| nad3 | -1.1113 | 0.035 | -31.673 | 0.000 | [-1.18, -1.042] |
| nad4 | 0.0658 | 0.034 | 1.928 | 0.054 | [-0.001, 0.133] |
| nad4l | -0.2532 | 0.034 | -7.409 | 0.000 | [-0.32, -0.186] |
| nad5 | 0.6381 | 0.034 | 18.512 | 0.000 | [0.571, 0.706] |
| nad6 | -0.2537 | 0.034 | -7.423 | 0.000 | [-0.321, -0.187] |
| rrnL | 0.2873 | 0.034 | 8.403 | 0.000 | [0.22, 0.354] |
| rrnS | -1.2345 | 0.035 | -35.003 | 0.000 | [-1.304, -1.165] |
| coef | std err | z | [95.0% Conf. Int.] | ||
| atp6 | -1.0959 | 0.069 | -15.901 | 0.000 | [-1.231, -0.961] |
| cob | -1.7306 | 0.073 | -23.593 | 0.000 | [-1.874, -1.587] |
| cox1 | 0.233 | 0.066 | 3.535 | 0.000 | [0.104, 0.362] |
| cox2 | -0.033 | 0.066 | -0.501 | 0.616 | [-0.162, 0.096] |
| cox3 | 1.4431 | 0.071 | 20.327 | 0.000 | [1.304, 1.582] |
| nad1 | -2.6491 | 0.082 | -32.159 | 0.000 | [-2.811, -2.488] |
| nad2 | -5.767 | 0.151 | -38.171 | 0.000 | [-6.063, -5.471] |
| nad3 | -0.0797 | 0.066 | -1.211 | 0.226 | [-0.209, 0.049] |
| nad4 | 2.4925 | 0.08 | 31.017 | 0.000 | [2.335, 2.650] |
| nad4l | -0.0296 | 0.066 | -0.449 | 0.653 | [-0.159, 0.099] |
| nad5 | -2.5196 | 0.081 | -31.123 | 0.000 | [-2.678, -2.361] |
| nad6 | -1.0355 | 0.069 | -15.097 | 0.000 | [-1.170, -0.901] |
| rrnL | -1.403 | 0.071 | -19.803 | 0.000 | [-1.542, -1.264] |
| rrnS | 2.2175 | 0.078 | 28.594 | 0.000 | [2.066, 2.370] |
| coef | std err | z | [95.0% Conf. Int.] | ||
| atp6 | 0.3534 | 0.055 | 6.479 | 0.000 | [0.247, 0.460] |
| cob | -0.3845 | 0.055 | -7.042 | 0.000 | [-0.492, -0.277] |
| cox1 | -1.5321 | 0.059 | -25.903 | 0.000 | [-1.648, -1.416] |
| cox2 | -1.0754 | 0.057 | -18.956 | 0.000 | [-1.187, -0.964] |
| cox3 | -1.7371 | 0.06 | -28.729 | 0.000 | [-1.856, -1.619] |
| nad1 | 2.305 | 0.065 | 35.601 | 0.000 | [2.178, 2.432] |
| nad2 | 3.6525 | 0.078 | 46.902 | 0.000 | [3.500, 3.805] |
| nad3 | 0.8116 | 0.056 | 14.577 | 0.000 | [0.702, 0.921] |
| nad4 | -1.4916 | 0.059 | -25.323 | 0.000 | [-1.607, -1.376] |
| nad4l | -1.641 | 0.06 | -27.427 | 0.000 | [-1.758, -1.524] |
| nad5 | -3.4317 | 0.075 | -45.481 | 0.000 | [-3.580, -3.284] |
| nad6 | 0.2363 | 0.054 | 4.343 | 0.000 | [0.130, 0.343] |
| rrnL | -0.7259 | 0.055 | -13.1 | 0.000 | [-0.834, -0.617] |
| rrnS | -1.3261 | 0.058 | -22.878 | 0.000 | [-1.440, -1.213] |
| coef | std err | z | [95.0% Conf. Int.] | ||
| atp6 | -0.9133 | 0.081 | -11.331 | 0.000 | [-1.071, -0.755] |
| cob | -1.3941 | 0.085 | -16.49 | 0.000 | [-1.560, -1.228] |
| cox1 | -0.1349 | 0.078 | -1.739 | 0.082 | [-0.287, 0.017] |
| cox2 | 0.8051 | 0.08 | 10.12 | 0.000 | [0.649, 0.961] |
| cox3 | -0.7384 | 0.08 | -9.278 | 0.000 | [-0.894, -0.582] |
| nad1 | -2.0258 | 0.092 | -22.074 | 0.000 | [-2.206, -1.846] |
| nad2 | -0.2326 | 0.078 | -2.992 | 0.003 | [-0.385, -0.080] |
| nad3 | -0.7489 | 0.08 | -9.406 | 0.000 | [-0.905, -0.593] |
| nad4 | -3.5617 | 0.131 | -27.208 | 0.000 | [-3.818, -3.305] |
| nad4l | 0.4031 | 0.078 | 5.168 | 0.000 | [0.250, 0.556] |
| nad5 | 2.1308 | 0.092 | 23.219 | 0.000 | [1.951, 2.311] |
| nad6 | -1.1314 | 0.082 | -13.766 | 0.000 | [-1.292, -0.970] |
| rrnL | -3.8926 | 0.146 | -26.68 | 0.000 | [-4.179, -3.607] |
| rrnS | 0.2623 | 0.078 | 3.376 | 0.001 | [0.110, 0.415] |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Protist diversity and phylogeny · Machine Learning in Bioinformatics
Taenia Biomolecular Phylogeny and the Impact of Mitochondrial Genes on this Latter
Huda Al-Nayyef1,2, Christophe Guyeux1, and Jacques M. Bahi1
1 FEMTO-ST Institute, UMR 6174 CNRS, DISC Computer Science Department
Université de Franche-Comté, 16, Rue de Gray, 25000 Besançon, France
2 Computer Science Department, University of Mustansiriyah, Iraq
{huda.al-nayyef, christophe.guyeux, jacques.bahi}@univ-fcomte.fr
Abstract
Variations in mitochondrial genes are usually considered to infer phylogenies. However some of these genes are lesser constraint than other ones, and thus may blur the phylogenetic signals shared by the majority of the mitochondrial DNA sequences. To investigate such effects, in this research work, the molecular phylogeny of the genus Taenia is studied using 14 coding sequences extracted from mitochondrial genomes of 17 species. We constructed 16,384 trees, using a combination of 1 up to 14 genes. We obtained 131 topologies, and we showed that only four particular instances were relevant. Using further statistical investigations, we then extracted a particular topology, which displays more robustness properties.
Index Terms:
Taenia, Phylogeny, Statistical tests
I Introduction
Taenia (Cestoda: Taeniidae) is a genus of tapeworm (a type of helminth) members that have some important parasites of livestock. These parasitic organisms handle taeniasis and cysticercosis in humans, which are a type of helminthiasis that was belonging to the group of neglected tropical diseases [22]. Despite intensive research, the taxonomy of this genus remains unclear. Based on morphology and life cycle data. An essential key to solve the last issues raised by Taenia is believed to be found in the study of the large amount of recently available DNA sequences, especially with complete mitochondrial (mt) genomes. Genes from mt genomes are classical markers for phylogeny. This DNA presents interesting features for such analysis: genes are shared by almost all eukaryotes and are present in a single copy, the molecule is maternally inherited and non-recombining in most cases, etc. [4].
Part of the problem resides in the fact that, even though the amount of information should be sufficient to infer a correct phylogeny of this genus. The presence of homoplasy in individually available genes clouds the general phylogenetic message, raising uncertainties in some locations of the tree. The question we discuss in the present work is thus to determine which genes are homoplastic, and which ones tell the story of the species. Our goal is thus to exhibit a well-supported phylogenetic tree of the genus Taenia. Our analysis relies on some recent statistical tools and intensive computations on available bio-molecular data [2, 3].
After a presentation of the major problems that remain need to be solved regarding the phylogeny of Taenia, we will describe in details our investigation protocol. We will then present how each phylogenetic tree inference has been conducted. Our approach mainly based on annotating from scratch each genome, using an efficient alignment tool, and various mutation models for mitochondrial coding sequences and RNAs. We will then explain how we have obtained the phylogenetic trees of that study, and how we have used them to solve the phylogenetic reconstruction problem for this genus as a result of estimating the influence of each gene on that topology.
To date, 17 complete mitochondrial genomes of Taenia have been published, their list and accession number being provided in Table I. These genomes have been used recently to update the phylogeny of this genus using molecular data. As presented in Table II many previous articles of phylogeny have worked with Taenia species, but none of them provides a well-supported tree for this genus. For this reason, the authors of this paper proposed the new computational methods for constructing and finding a well-supported phylogenetic tree for Taenia [1].
The remainder of this article is constituted as follows. Section II is devoted to the proposed methodology intended to improve the estimation of the phylogenetic tree. Finer statistical investigations of the homoplastic character of certain genes are detailed in Section III. This article ends with a discussion and a description of possible future work on this problem.
II Materials and methods
II-A Alignment and annotations of coding sequences
To answer the aforementioned questions, first Bayesian and maximum likelihood analyses have been realized on either the whole mitogenomes or its twelve protein coding genes. Theses analyses were realized using nucleotides and translated amino acids sequences. Tools used during these first runs of analyses were:
- •
Muscle [6] for aligning complete mitogenomes and T-Coffee [19] for genes alignments;
- •
NCBI annotations for coding sequences in a first analysis, and then DOGMA [24] in a deeper stage;
- •
PhyloBayes [15] for Bayesian inference, while PhyML [9] and RAxML [23] have been used for maximum likelihood.
At each time, a problem of support (at least one bootstrap lower than 95, while a commonly accepted rule claims that all supports must be larger than this threshold [7]) was found at least at one location of the obtained tree. Partial conclusions of these preliminary studies were that: (1) to use coding sequences is better than to consider the whole mitogenome, (2) there are inconsistencies in NCBI annotations, (3) T-Coffee alignments seem better than muscle ones, (4) many coding sequences narrate the story of the genus while others tell their own history, and (5) to enlarge the amount of data leads to more supported trees.
II-B Methodological approach
To solve both the phylogeny of Taenia and the determination of genes that break it, a solution has been to consider all available or obtainable coding sequences shared by these 18 species, and to investigate how the inferred phylogenies evolve when using a various subset of these sequences. Doing so enlarge the first investigations of Hardman et al. [10], who have studied the phylogeny of 5 Taeniidae according to each of the 12 mitochondrial genes taken alone, 14 sequences have been extracted from each of the considered species: 12 protein coding sequences and 2 rRNAs from the mitochondrial genomes. They are listed below.
- •
Mitochondrial protein coding sequences:
atp6 (ATP synthase 6), cob (cytochrome b), cox1 (cytochrome c oxydase 1), cox2 (cytochrome c oxydase 2), cox3 (cytochrome c oxydase 3), nad1 (NADH dehydrogenase subunit 1), nad2 (NADH dehydrogenase subunit 2), nad3 (NADH dehydrogenase subunit 3), nad4 (NADH dehydrogenase 4), nad4l (NADH dehydrogenase subunit 4L), nad5 ((NADH dehydrogenase subunit 5), nad6 (NADH dehydrogenase subunit 6).
- •
Mitochondrial rRNAs:
rrnL (large subunit rRNA), rrnS (small subunit rRNA).
DOGMA, for its part, has been used to annotate from scratch each up-to-date complete mitochondrial genome downloaded from NCBI [5] Default parameters of DOGMA have been selected, namely an identity cutoff for protein equal to and for coding genes and rRNAs respectively for Taenia species, while these thresholds have been reduced to and for T. mustelae, due to a problem of detection of nad6 and rrnL respectively. The e-value was equal to , and the number of blast hits to return has been set to 5.
Each of these 14 coding sequences has been aligned separately by using T-Coffee (M-Coffee mode, using 6 cores for multiprocessing). Then 16,384 trees were constructed, corresponding to all the possible combinations of 1, 2, 3, …, and 14 coding sequences among the 14 available ones (), as described in Algorithm 1. This computation has taken 3 months on the “Mésocentre de Calcul de Franche-Comté” supercomputer facilities. The idea behind was to determine both the most supported phylogenetic trees and the effects of each gene on topologies and supports. RAxML version 8.0.20 were used for maximum likelihood inference, with 3 distinct models/data partitions with joint branch length optimization at each computation, corresponding to the mitochondrial rRNAs, and the mitochondrial protein coding sequences. All free model parameters have been estimated by RAxML for both GAMMA model of rate heterogeneity and ML estimate of alpha-parameter. At each time, a maximum of 1000 non-parametric bootstrap inferences was executed, with MRE-based bootstopping criterion, and E. vogeli has been used as outgroup.
III Discussion and results
III-A Results
131 topologies were obtained during our computations with 17 species and 1 outgroup. Further information regarding these trees are provided in Table III: in this latter, we investigated the 15 most frequent topologies that contained 15,442 of the 16,384 trees (94.16%). For each topology, the lowest bootstrap of the best tree (that is, the lowest bootstrap of the tree that maximizes the minimum taken over all its bootstraps), the number of trees having this topology, the average minimal bootstrap value, and the list of genes that have been removed to obtain the best tree having this topology, are provided. Only 4 of these 131 topologies have a number of occurrences larger than 700, when considering the 16,384 obtained trees. They are depicted in Figure 1.
These 4 best topologies representing 77.07% of the obtained trees share most of their structure. For instance, T. madoquae, T. serialis, T. multiceps, are within a same clade, which is sister to the clade consisting of T. asiatica and T. saginata. The differences between these most frequent topologies are depicted with dotted lines in Figure 1 while Table IV summarizes them using CompPhy tool [8].
Various reasons have led us to consider the Topology 1 depicted in Figure 1(a) as the most probable one. Firstly, this is the most frequent topology, representing 39.31% of the produced trees while the second one (Topology 2) represents only 19.99% of the trees. Furthermore, this topology remains the most frequent one when we focus on trees generated by removing 0, 1, 2, 4, 5, 6, and 7 genes (notice that the largest number of trees are obtained when removing 6 genes for Topology 1, while we need to discard 7 genes to reach the largest populations of trees for Topologies 0, 2, and 3, see Figure 2(a)).
The number of trees whose lowest bootstraps is greater than 70 is nearly the same for Topologies 1 and 2 but, in Topo. 1, the largest number of trees is obtained when 5 genes are discarded, while we need to remove 7 genes to reach this maximum with topology number 2, as depicted in Figure 2(b). Additionally, we can remark that in the topology number 1, the lowest bootstrap in the best tree does not evolve so much when removing between 0 and 7 genes (it ranges between 70 and 84) while in Topology 2, the best lowest value (92) in the best trees is obtained with 7 gene loss (see Figure 2(c)).
Almost all results listed above tend to prove that Topology 1 is the best candidate for reflecting the Taenia phylogeny. However the fact that the tree having the best lowest bootstrap (82) belongs to Topology number 2 raises certain questions. It is true that this latter has been obtained by removing half of the genes, but there is no denial in the fact that topology the most frequent and topology having the most supported tree are not the same. To go deeper in the analysis of these topologies, we began to use SuperTriplets tool [21] on the following two experiments. The supertree of all trees belonging to the four topologies presented before has been firstly computed, while in a second run of experiments, the supertree for all 16,384 phylogeny trees have been determined. Obtained results are reproduced in Figures 3(a) and 3(b): at each time, Topology 1 has been obtained, thus reinforcing the view that this topology should reflect well the Evolution history of Taenia.
To validate this choice, next subsections will now investigate more deeply the relation between genes on the one hand, and both tree topologies and supports on the other hand.
III-B Gene occurrences
A first investigation consists of regarding if the presence of each gene is uniformly distributed in each of the 4 most frequent topologies, using Algorithm 2. Since each of the 16,384 produced trees is constructed using a subset of the 14 available sequences, it is relevant to count the number of occurrences for each of these sequences. Table V summarizes these results.
A correlation seems to appear between some genes, either over or under-represented, and some topologies. More precisely, the following information can be deduced by checking the effects of the three least frequent genes:
nad1 is ranked as the least frequent gene in Topologies 0, 1, and 3, while this gene is the second most frequent one for Topology 2 (i.e., this mitochondrial coding sequence gene plays an essential role in Topology 2). 2. 2.
It seems that taking nad5 into consideration leads to a move of T. hydatigena in the tree, as it is ranked 12 and 14 for Topologies 0 and 2 respectively, and 3 and 1 for Topologies 1 and 3 respectively. 3. 3.
Similarly, rrnS seems responsible for the position evolution of T. laticollis and T. pisiformis: this gene is ranked in 2nd position for Topology 0 while this is the least frequent gene (last position) for Topology 1. 4. 4.
Gene nad5 is ranked first for Topology 3, so it may impact the sister relationship between T. solium and T. ovis.
However, these claims need to be further investigated by a more rigorous statistical approach, which is the aim of the following sections.
III-C Genes influence on topology using Dummy logit model
To investigate more deeply the effects of each coding sequence on the species topology, 4 dummy binary choice logit models have been realized (one per each best topology) using scikit-learn [20] module of Python language. The reference to the exogenous design is a array, each row being a vector of 0’s and 1’s: a 0 in position of row means that, in the -th tree computation, gene number (in alphabetic order) were discarded, and conversely it was considered if the coefficient is 1. Rows are thus the “observations” while columns correspond to regressors. The 1-d endogenous response variable, for its part, was a vector of size , having an 1 in position if and only if Topology 1 has been produced with the choice of genes corresponding to the row number in the exogenous design (resp. Topology 0, 2, or 3 in the three other binary choice logit models). The model has been fitted using maximum likelihood with Newton-Raphson solver. Convergence has been obtained after 8 iterations, and the Logit regression results are summarized in Table VI.
A first conclusion of the results obtained when investigating the impact of each gene on the most supported topology is that all considered coding sequences bring information, except perhaps the particular case of nad4 (see column ). Additionally, when the effect of a mitochondrial coding sequence is negative regarding Topology 1, its impact is not very pronounced, while cob, cox1, and nad5 contribute the most to this topology (see coef column: large absolute value means large effect, while negative coefficients tend to break the topology). All these findings are coherent with the frequency of occurrences of each gene in the choice of Topology 1: nad5, cox1, and nad5 were present in 12,642 computations leading to this topology (77.07%), while only computations with rrnS, nad3, and nad1 have led to this topology (53%), as described in Table V.
Further investigations of the role of each sequence and their effects on each topologies are provided in Tables VII, VIII, and IX of supplementary data, which contain the results of the dummy logit regression test for Topologies 0, 2, and 3 respectively.
IV Conclusion
Deep investigation of the molecular phylogeny of the Taenia genus has been performed in this paper. 14 coding sequences, taken from mitochondrial genomes, have been considered for maximum likelihood phylogenetic reconstruction. As the obtained tree was not satisfactorily supported, each combination from 1 to 14 genes has been further investigated, leading to trees representing 131 topologies. Four close topologies were then isolated whose differences are located in the position of T. hydatigena and the sister relationship between T. laticollis and T. pisiformis. Using the logit model we have finally proven that Topology 1 was the most probable one and have emphasized the negative role of some genes for that phylogeny.
In future work, the authors intend to use LASSO test for regressing the bootstrap on the genes. Furthermore, we will investigate the phylogeny of Echinococcus using a similar approach. Indeed, there is no general agreement regarding the phylogeny of this genus. In particular, some species were discovered to have contradictory positions in the available literature. All the possible combinations of the 12 mitochondrial genes, plus rrnL and rrnS and also 5 nuclear genes, will be considered, leading to the production of phylogenetic trees. Their topologies will be compared, and the influence of each gene on these topologies will be rigorously measured, in order to determine the most probable phylogenetic tree of this species. Finally, the phylogeny of the class Eucestoda will be investigated using a similar approach.
All computations have been performed using the Mésocentre de Calcul de Franche-Comté facilities.
V Appendices
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Bassam Alkindy, Huda Al-Nayyef, Christophe Guyeux, Jean-François Couchot, Michel Salomon, and Jacques Bahi. Improved core genes prediction for constructing well-supported phylogenetic trees in large sets of plant species. In Bioinformatics and Biomedical Engineering , volume 9043, pages 379 – 390, Granada, Spain, apr 2015. Springer.
- 2[2] Bassam Al Kindy, Bashar Al-Nuaimi, Christophe Guyeux, Jean-Fran¸cois Couchot, Michel Salomon, Reem Alsrraj, and Laurent Philippe. Binary particle swarm optimization versus hybrid genetic algorithm for inferring well supported phylogenetic trees. Computational Intelligence Methods for Bioinformatics and Biostatistics , 9874:165–179, 2016. Revised and extended journal version of the CIBB 2015 conference.
- 3[3] Bassam Al Kindy, Jean-François Couchot, Christophe Guyeux, Arnaud Mouly, Michel Salomon, and Jacques M. Bahi. Finding the core-genes of chloroplasts. Journal of Bioscience, Biochemistery, and Bioinformatics , 4(5):357–364, 2014. Journal version of ICBBS 14 conference.
- 4[4] J. William O. Ballard and Michael C. Whitlock. The incomplete natural history of mitochondria. Molecular Ecology , 13(4):729–744, 2004.
- 5[5] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers. Gen Bank. Nucleic Acids Res. , 37(Database issue):26–31, Jan 2009.
- 6[6] R. C. Edgar. Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics , 5(1), August 2004.
- 7[7] Joseph Felsenstein. Confidence limits on phylogenies: an approach using the bootstrap. Evolution , pages 783–791, 1985.
- 8[8] Nicolas Fiorini, Vincent Lefort, François Chevenet, Vincent Berry, and Anne-Muriel A Chifolleau. Compphy: a web-based collaborative platform for comparing phylogenies. BMC evolutionary biology , 14(1):253, 2014.
