On the accuracy of ancestral sequence reconstruction for ultrametric trees with parsimony
Lina Herbst, Mareike Fischer

TL;DR
This paper proves that for ultrametric trees and the Jukes-Cantor model, Fitch's parsimony method using all terminal taxa is at least as accurate as using any subset, confirming a conjecture for four-state data.
Contribution
It confirms a conjecture that using all terminal taxa with Fitch's method yields optimal accuracy for ancestral sequence reconstruction under the Jukes-Cantor model on ultrametric trees.
Findings
Fitch's method with all taxa is at least as accurate as any subset.
The conjecture is confirmed for four-state models, relevant to DNA/RNA.
Results extend previous two-state data findings to more realistic biological models.
Abstract
We examine a mathematical question concerning the reconstruction accuracy of the Fitch algorithm for reconstructing the ancestral sequence of the most recent common ancestor given a phylogenetic tree and sequence data for all taxa under consideration. In particular, for the symmetric 4-state substitution model which is also known as Jukes-Cantor model, we answer affirmatively a conjecture of Li, Steel and Zhang which states that for any ultrametric phylogenetic tree and a symmetric model, the Fitch parsimony method using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any particular terminal taxon or any particular pair of taxa. This conjecture had so far only been answered for two-state data by Fischer and Thatte. Here, we focus on answering the biologically more relevant case with four states, which corresponds to ancestral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Fractal and DNA sequence analysis
On the accuracy of ancestral sequence reconstruction for ultrametric trees with parsimony
Lina [email protected]
Mareike [email protected]
Institute for Mathematics and Computer Science, Greifswald University, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany
Abstract
We examine a mathematical question concerning the reconstruction accuracy of the Fitch algorithm for reconstructing the ancestral sequence of the most recent common ancestor given a phylogenetic tree and sequence data for all taxa under consideration. In particular, for the symmetric 4-state substitution model which is also known as Jukes-Cantor model, we answer affirmatively a conjecture of Li, Steel and Zhang which states that for any ultrametric phylogenetic tree and a symmetric model, the Fitch parsimony method using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any particular terminal taxon or any particular pair of taxa. This conjecture had so far only been answered for two-state data by Fischer and Thatte. Here, we focus on answering the biologically more relevant case with four states, which corresponds to ancestral sequence reconstruction from DNA or RNA data.
keywords:
Maximum Parsimony , ancestral sequence reconstruction , reconstruction accuracy , symmetric 4-state model
MSC:
[2010] 00-01, 99-00
††journal: Journal of Theoretical Biology
1 Introduction
The reconstruction of ancestral sequences, e.g. DNA-sequences of common ancestors of present-day species, is an important approach in understanding the evolution and origin of these species [1, 2, 3]. There exist various methods to do such reconstructions, e.g the Fitch algorithm [4, 5, 6], which is based on the Maximum Parsimony criterion. However, how reliable is such a reconstruction?
Several studies analyzed the reliability, the so-called reconstruction accuracy, of the Fitch algorithm for reconstructing ancestral sequence data of the most recent common ancestor given a phylogenetic tree and sequences for all taxa under consideration [1, 7, 8]. It seems intuitive that the root state is more likely to be conserved for taxa that are closer to the root, since over time more sequence changes can occur. Moreover, one might expect that the reconstruction accuracy is highest when all taxa are taken into account, which was also suggested by earlier simulation studies [9]. However, it can be shown that there are cases in which the reconstruction accuracy improves when only a subset of taxa is considered [1, 7]. In particular, the reconstruction accuracy can even improve when a taxon close to the root is ignored [7].
Despite these counterintuitive results, in 2008 Li et al. conjectured that for any rooted binary ultrametric phylogenetic tree (i.e. a tree in which all branches have the same distance to the root) and a simple model of evolution, the Fitch algorithm using all taxa for ancestral state reconstruction is at least as accurate as using a single taxon [1]. Note that ultrametric trees are also often referred to as clocklike trees or molecular clocks. So the conjecture by Li et al. means that under a molecular clock, the reconstruction accuracy is at least as good as the conservation probability of any taxon. Note that under a molecular clock all taxa have the same conversation probability, and that this conjecture provides a lower bound on the reconstruction accuracy for any rooted binary ultrametric phylogenetic tree under a simple model of evolution. Ignoring all data besides the data of one species displays the extreme case of throwing information away. Thus, showing that the conjecture holds is good news for Maximum Parsimony as a criterion for ancestral state reconstruction.
In 2009, Fischer and Thatte [7] proved the conjecture for two-state characters, but it remained unclear if it also holds for 4-state data like DNA or RNA. Thus, the aim of this paper is to consider this biologically relevant case with four states. In particular, we answer the conjecture affirmatively. Additionally, we also prove that the conjecture holds for three-state characters. Along the way, we also prove that the Fitch parsimony method applied to all taxa is always at least as good as applied to any pair of taxa if the underlying tree is clocklike. However, we also show that this does not improve the lower bound induced by single leaves.
2 Preliminaries
Before we can present our results, we first have to introduce some basic concepts. Recall that a rooted binary phylogenetic tree on the leaf set () is a connected, acyclic graph in which the vertices of degree 1 are called leaves, and in which there is exactly one node of degree 2, which is referred to as root, and all other non-leaf nodes have degree 3. Moreover, in a rooted binary phylogenetic -tree the leaves are bijectively labelled by the elements of . Let each vertex of the tree be assigned a state element of a finite state set with . In particular, we are interested in the biologically relevant case with four states, e.g. , which corresponds for instance to DNA or RNA data.
The states evolve from by the well-known symmetric -state model with alphabet [4]. In this model, a state of is selected as the root state with probability . Assume that is an edge of the tree, and node is closer to the root than . Then in this model, is the substitution probability on edge : it is the probability that is in some state under the condition that is in a distinct state, say, . This is denoted by . The model is supposed to be symmetric, thus . Furthermore, we assume that , in particular for four states we have . The biologically relevant case with four states, namely the -model, is also often referred to as Jukes-Cantor-model [10].
Similar as in [7, 11], we consider ultrametric trees, often known as clocklike trees or molecular clocks by biologists. It means that the expected number of substitutions from the root to any leaf is the same [5].
In this manuscript we reconstruct ancestral states by the Maximum Parsimony criterion with the Fitch algorithm, which we briefly explain now. Assume that we have a rooted binary tree with leaf set . To introduce the Fitch algorithm, we first consider the kind of data we will map onto the leaves of the tree. The data is given by a character on a leaf set , which is a function . Thus, each leaf is assigned a character state. Note that as we consider , we often write instead of listing explicitly.
Then the Fitch algorithm [6] assigns a set of states to all interior vertices by minimizing the number of changes. The algorithm is based on Fitch’s parsimony operation. Therefore, let be a non-empty finite alphabet and let . Then, Fitch’s parsimony operation is defined by
[TABLE]
Using this operation, the Fitch algorithm works as follows. Consider all vertices , whose two direct descendants have already been assigned a set, say and . Then, is assigned . This step is continued upwards along the tree until the root is assigned a set, which is denoted by . An example can be seen in Figure 1.
Note that what we call the Fitch algorithm is in fact only one phase of the algorithm, but it is the only part we require to estimate potential root states. For more details we refer to [6].
For a 4-state-character there are possible sets for each interior vertex, since 16 is the cardinality of the power set of an alphabet with four elements minus one for the empty set, i.e.: .
We say that the Fitch algorithm unambiguously reconstructs the root state if . Otherwise the root state is reconstructed ambiguously, i.e. the method cannot decide between different states and therefore .
Note that real data usually comes in the form of an alignment, i.e. a sequence of characters, rather than in the form of an individual character. In this case, the Fitch algorithm would consider each character, i.e. each column (“site”) of the alignment, separately. This is why we focus on the case of a single character and its reconstruction accuracy.
3 The accuracy of ancestral sequence reconstruction with 4-state characters
Similar to Li et al., we now define the reconstruction accuracy for all [1]. Therefore, let denote the set of character states chosen by the Fitch algorithm as possible root states when applied to character on tree .
Let and . The probability that the root state evolves on to a character for which the Fitch algorithm assigns as possible root state set is given by .
The reconstruction accuracy is then defined by
[TABLE]
To illustrate this definition, consider the case with . In this case, the reconstruction accuracy for the Fitch algorithm for ancestral state reconstruction is given by
[TABLE]
where we define
[TABLE]
The main aim of this manuscript is to show that the reconstruction accuracy for a rooted binary ultrametric phylogenetic tree under the -model using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any particular terminal taxon. This provides a lower bound on , and is stated in the following theorem.
Theorem 1**.**
For any rooted binary phylogenetic ultrametric tree and the -model, the Fitch algorithm using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any particular terminal taxon, that is
[TABLE]
The proof of Theorem 1 requires some more general properties. Therefore, we first turn our attention to the following. If not stated otherwise, we always consider rooted binary ultrametric phylogenetic trees under the -model. Due to the symmetry of the model, we can assume without loss of generality that the root is in state , so evolves along the tree to a character on . Let be the probability that from the root to one leaf the state changes from to one specific state in , i.e. is the probability that a given leaf is not in state .
Therefore, in the case of the -model, is the probability that the root is in the same state as one leaf, since three different changes () can occur. This is at the same time the reconstruction accuracy when only one leaf is taken into account. The main aim of this paper is to show that is a lower bound for ; that is considering all taxa under a molecular clock is always better, or as good as, considering just one taxon.
As shown in Figure 2, every binary tree can be decomposed into two maximal pending subtrees and with leaf sets and (). This is the so-called standard decomposition [5]. We denote the children of by and , and with probability one specific change occurs from to (). Analogously, one specific change occurs from to any leaf with probability (). Note that can then be calculated by all possibilities given for one specific change from to any leaf. Suppose that the root is in state and leaf in state (without loss of generality we have ). Then there are four different possibilities for a change from to :
[TABLE]
Thus,
[TABLE]
Furthermore, for we define , and similarly .
Under the model assumptions of the -model, due to the symmetry, we have that
[TABLE]
since e.g.
[TABLE]
Therefore by (2), (4) and (5), can be simplified and becomes
[TABLE]
Moreover, we define
[TABLE]
Again, by the symmetry of the -model, we obtain
[TABLE]
Biologically this means that under the assumption that is the true root state, the probability that evolves to a character for which the Fitch algorithm assigns to the root is the same as for and , since each specific change occurs with probability .
This brings us to our next result, where and are linked to each other.
Lemma 1**.**
For any rooted binary phylogenetic tree and the -model we have that
[TABLE]
Note that Lemma 1 does not require the underlying tree to be ultrametric.
The proof of Lemma 1 is by induction on and is presented in the appendix. For this proof and also for the proof of Theorem 1 we state some recursions required for the induction. Therefore, we define as a restriction of to for : . For the probability to obtain a set as estimate state for with the Fitch algorithm under the assumption that is in state can be defined using the law of total probability:
[TABLE]
Then with (4),(5),(7) we have:
[TABLE]
With (8), (9), (10), (11), (12), (13) and (14) we therefore have
[TABLE]
As stated before, all these recursions are needed for the proof of Lemma 1 and Theorem 1. Now, we are in the position to prove Theorem 1, our main result, which states a lower bound on .
Proof.
The proof is by induction on . In order to show , we define , and show that is non-negative.
For the subtrees and both contain one leaf, and thus
[TABLE]
This shows that is non-negative and thus , which completes the base case of the induction.
Now, we show by induction that is non-negative. Suppose that has taxa and that is non-negative for all trees having fewer than taxa. We define for . Thus, and are non-negative since and contain both fewer than taxa.
By elementary term conversion we can show that
[TABLE]
The exact conversions can be found in the appendix.
Moreover, note that are all probabilities and therefore are all non-negative for . By Lemma 1 we have that (for ) and are non-negative, resulting in (19) being non-negative. This implies and thus . This completes the proof. ∎
We have shown that the reconstruction accuracy using all terminal taxa is always greater or equal than the conservation probability of one single taxon. Moreover, the base case of the proof of Theorem 1 provides more insight into the reconstruction accuracy of using 2-taxon trees under the -model.
Corollary 1**.**
Let be a rooted binary ultrametric phylogenetic tree on taxon set with . Let denote the probability of change from the root to any leaf under the -model. Then, the reconstruction accuracy for ancestral state reconstruction using the Fitch algorithm is given by
[TABLE]
Corollary 1 states the reconstruction accuracy for ancestral state reconstruction with the Fitch algorithm using ultrametric 2-taxon trees, which is the same probability when using one terminal taxon. In the following proposition we show that the reconstruction accuracy with the Fitch algorithm using any two terminal taxa of a taxa set is also .
Proposition 1**.**
For any rooted binary phylogenetic ultrametric tree and the -model, the reconstruction accuracy for the Fitch algorithm using any two terminal taxa for ancestral state reconstruction is given by
[TABLE]
Proof.
Let be two terminal taxa of any rooted binary ultrametric phylogenetic tree . Moreover, we consider the standard decomposition of into its two maximal pending subtrees and as depicted in Figure 2. Thus, the proof is divided into two cases.
In the first case we have without loss of generality and . By Corollary 1 the reconstruction accuracy using and is then .
In the second case we have either or . Thus, without loss of generality we consider as depicted in Figure 3. Let be the last common ancestor of and , i.e. the first node that occurs both on the path from to as well as on the path from to . Let be the subtree of that consists of the paths from to and , respectively, as well as all vertices which lie on one of these paths. is depicted with dotted lines in Figure 3. Thus, the root of is . In addition, let be the probability for one specific change from to , and let be the probability for one specific change from to or .
By (6) we have
[TABLE]
Note that since we cannot obtain sets with more than two elements with the Fitch algorithm when only and are used for the reconstruction.
In the following, we use the notation for the restriction of character on taxa and .
Furthermore, we have
[TABLE]
Moreover,
[TABLE]
Thus by 21 and (22), (20) becomes
[TABLE]
Therefore, in both cases which completes the proof. ∎
This proposition provides us the reconstruction accuracy for the Fitch algorithm when any two terminal taxa are considered. Note that this reconstruction accuracy is the same as when only one terminal taxon is taken into account. Therefore, by Theorem 1 and Proposition 1 we have the following corollary, which states that the lower bound on the reconstruction accuracy holds for any two terminal taxa. In particular, considering two taxa rather than one cannot improve the lower bound given by Theorem 1.
Corollary 2**.**
For any rooted binary phylogenetic ultrametric tree and the -model, the Fitch algorithm using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any two terminal taxa, that is
[TABLE]
This statement completes Section 3, and we now have a look on similar results obtained for the -model.
4 The accuracy of ancestral sequence reconstruction with 3-state characters
Under the same assumptions as for the 4-state model, similar results can be obtained for the 3-state alphabet . In this case, the reconstruction accuracy is given by
[TABLE]
Then Theorem 2 and Lemma 2 can be formulated similarly to the statements before. Both proofs are left out, since they can be done analogously. However, we want to emphasize that the conjecture stated by Li et al. also holds for the -model.
Theorem 2**.**
For any rooted binary phylogenetic ultrametric tree and the -model, the Fitch algorithm using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any particular terminal taxon, that is
[TABLE]
By Theorem 2, a lower bound on for rooted binary ultrametric phylogenetic trees is also given for .
Note that the analogs of Lemma 1, Corollary 1, Proposition 1 and Corollary 2 also hold under the -model. In particular, the reconstruction accuracy for ultrametric trees is then at least . The exact statements and their proofs can be found in the appendix.
5 Conclusion and Discussion
In this paper we considered the reconstruction accuracy of the Fitch algorithm for ancestral state reconstruction. In particular, we analyzed rooted binary ultrametric phylogenetic trees under the -model. For an ultrametric tree the probability of a change from the root to any leaf is the same. For such trees, we investigated a lower bound on the reconstruction accuracy by answering affirmatively the conjecture by Li, Steel and Zhang, which stated that for rooted binary ultrametric phylogenetic trees under the symmetric -model the reconstruction accuracy using all terminal taxa is at least as high as the conservation probability of any leaf. In 2009, Fischer and Thatte had already shown that this conjecture holds for two-state characters, but it remained unknown whether this result could be extended to three or more character states. In particular, the biologically relevant case of , which corresponds to the DNA- or RNA-alphabet, remained unclear.
The main result of this manuscript is the proof of the conjecture for , which provides a lower bound on the reconstruction accuracy. As mentioned before, the conjecture also holds for the -model. In the past, several studies showed that in some cases, the Fitch algorithm provides better results when some data are disregarded [1, 7]. This led to a critical view on Maximum Parsimony as a method for ancestral state reconstruction. But as we have shown here, at least for ultrametric trees, the extreme case of disregarding all data except for one or two leaves can never improve the reconstruction accuracy of the Fitch algorithm. In this sense, our results are good news for Maximum Parsimony as a method for ancestral state reconstruction.
To conclude, the generalization to the -model for is still open, but we conjecture that it also holds.
6 Appendix
Proof of Lemma 1 To prove Lemma 1 we show that for any rooted binary phylogenetic tree under a symmetric 4-state substitution model
[TABLE]
by induction on . For the subtrees and both contain one leaf, and hence leads to
[TABLE]
Therefore
[TABLE]
Moreover
[TABLE]
and
[TABLE]
which completes the base case of the induction. For the inductive step we first state some more recursions using (9), (10), (11), (12) and (13):
[TABLE]
[TABLE]
[TABLE]
Moreover we have that for
[TABLE]
and thus
[TABLE]
In the same manner by (10), (11), (12) and (13) we can see that
[TABLE]
Therefore
[TABLE]
Additionally we have the following: choose sets from and from such that for , respectively. Then we have that
[TABLE]
Now suppose that has taxa and that (24),(25) and (26) are true for all trees having fewer that taxa. Note that therefore (36) is non-negative, since and contain both fewer than than taxa. Then
[TABLE]
[TABLE]
[TABLE]
[TABLE]
By (36) and the inductive assumption this term is non-negative, and therefore concludes the proof for . We now proceed with the second part of Lemma 1.
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Again by (36) and the inductive assumption this term is non-negative, and therefore concludes the proof for . Moreover we have
[TABLE]
By (36) and the inductive assumption is non-negative, and therefore concludes the proof of the last part of Lemma 1. ∎
Extension to the proof of Theorem 1 First of all we state some equations for which helps to show (19).
[TABLE]
By (37) we have that
[TABLE]
and
[TABLE]
Furthermore, the following expressions can be simplified by (6), (38) and (39).
[TABLE]
Moreover,
[TABLE]
Additionally,
[TABLE]
and
[TABLE]
Furthermore,
[TABLE]
and
[TABLE]
By using the simplifications stated before we can now rewrite .
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Then
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Lemma 2**.**
For any rooted binary phylogenetic tree and the -model we have that
[TABLE]
Note that Lemma 2 does also not require the underlying tree to be ultrametric.
Proof.
To prove Lemma 2 we show that for any rooted binary phylogenetic tree under a symmetric 3-state substitution model
[TABLE]
by induction on . For the subtrees and both contain one leaf, and hence leads to
[TABLE]
Therefore
[TABLE]
Moreover
[TABLE]
which completes the base case of the induction. For the inductive step we first define some recursions similar to (8), (9), (10), (11) and (12):
[TABLE]
With (48), (49), (50), (51) and (52) we therefore have:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Moreover we have that for
[TABLE]
and thus
[TABLE]
In the same manner by (50) and (51) we can see that
[TABLE]
Therefore
[TABLE]
Additionally we have the following: choose sets from and from such that for , respectively. Then we have that
[TABLE]
Now suppose that has taxa and that (46) and (47) are true for all trees having fewer that taxa. Note that therefore (61) is non-negative, since and contain both fewer than than taxa. Then
[TABLE]
By the inductive assumption this term is non-negative, and therefore concludes the proof for . We now proceed with the second part of Lemma 2.
[TABLE]
By inductive assumption is non-negative, and therefore concludes the proof of the second part of Lemma 2. ∎
Corollary 3**.**
Let be a rooted binary ultrametric phylogenetic tree on taxon set with . Let denote the probability of change from the root to any leaf under the -model. Then, the reconstruction accuracy for ancestral state reconstruction using the Fitch algorithm is given by
[TABLE]
Proposition 2**.**
For any rooted binary phylogenetic ultrametric tree and the -model, the reconstruction accuracy for the Fitch algorithm using any two terminal taxa for ancestral state reconstruction is given by
[TABLE]
Proof.
Let be two terminal taxa of any rooted binary ultrametric phylogenetic tree . Moreover, we consider the standard decomposition of into its two maximal pending subtrees and as depicted in Figure 2. Thus, the proof is divided into two cases.
In the first case we have without loss of generality and . By Corollary 3 the reconstruction accuracy using and is then .
In the second case we have either or . Thus, without loss of generality we consider as depicted in Figure 3. Let be the last common ancestor of and , i.e. the first node that occurs both on the path from to as well as on the path from to . Let be the subtree of consisting of the paths from to and , respectively. is depicted with dotted lines in Figure 3. Thus, the root of is . In addition, let be the probability for one specific change from to , and let be the probability for one specific change from to or . By (23) we have
[TABLE]
Note that since we cannot obtain sets with more than two elements with the Fitch algorithm when only and are used for the reconstruction.
In the following, we use the notation for the restriction of character on taxa and .
Furthermore, we have
[TABLE]
Moreover,
[TABLE]
Thus by 63 and (64), (62) becomes
[TABLE]
Therefore, in both cases which completes the proof. ∎
Corollary 4**.**
For any rooted binary phylogenetic ultrametric tree and the -model, the Fitch algorithm using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any two terminal taxa, that is
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. Li, M. Steel, L. Zhang, More taxa are not necessarily better for the reconstruction of ancestral character states, Systematic Biology 57 (4) (2008) 647–653. doi:10.1080/10635150802203898 . · doi ↗
- 2[2] D. A. Liberles i (ed), Ancestral Sequence Reconstruction, Oxford University Press, 2007.
- 3[3] J. Yang, J. Li, L. Dong, S. Grünewald, Analysis on the reconstruction accuracy of the fitch method for inferring ancestral states, BMC Bioinformatics 12 (18). doi:10.1186/1471-2105-12-18 . · doi ↗
- 4[4] C. Tuffley, M. Steel, Links between maximum likelihood and maximum parsimony under a simple model of site substitution, Bulletin of Mathematical Biology 59 (3) (1997) 581–607. doi:10.1007/BF 02459467 . · doi ↗
- 5[5] C. Semple, M. Steel, Phylogenetics, Oxford Lecture Series in Mathematics and its Application, 2003.
- 6[6] W. M. Fitch, Toward defining the course of evolution: Minimum change for a specific tree topology, Systematic Zoology 2 (4) (1971) 406–416. doi:10.2307/2412116 . · doi ↗
- 7[7] M. Fischer, B. D. Thatte, Maximum parsimony on subsets of taxa, Journal of Theoretical Biology 260 (2) (2009) 290–293. doi:10.1016/j.jtbi.2009.06.010 . · doi ↗
- 8[8] L. Zhang, J. Shen, J. Yang, G. Li, Analyzing the fitch method for reconstructing ancestral states on ultrametric phylogenetic trees, Bulletin of Mathematical Biology 72 (2010) 1760–1782. doi:10.1007/s 11538-010-9505-8 . · doi ↗
