On the accuracy of ancestral sequence reconstruction for ultrametric   trees with parsimony

Lina Herbst; Mareike Fischer

arXiv:1706.06085·q-bio.PE·June 20, 2017

On the accuracy of ancestral sequence reconstruction for ultrametric trees with parsimony

Lina Herbst, Mareike Fischer

PDF

Open Access

TL;DR

This paper proves that for ultrametric trees and the Jukes-Cantor model, Fitch's parsimony method using all terminal taxa is at least as accurate as using any subset, confirming a conjecture for four-state data.

Contribution

It confirms a conjecture that using all terminal taxa with Fitch's method yields optimal accuracy for ancestral sequence reconstruction under the Jukes-Cantor model on ultrametric trees.

Findings

01

Fitch's method with all taxa is at least as accurate as any subset.

02

The conjecture is confirmed for four-state models, relevant to DNA/RNA.

03

Results extend previous two-state data findings to more realistic biological models.

Abstract

We examine a mathematical question concerning the reconstruction accuracy of the Fitch algorithm for reconstructing the ancestral sequence of the most recent common ancestor given a phylogenetic tree and sequence data for all taxa under consideration. In particular, for the symmetric 4-state substitution model which is also known as Jukes-Cantor model, we answer affirmatively a conjecture of Li, Steel and Zhang which states that for any ultrametric phylogenetic tree and a symmetric model, the Fitch parsimony method using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any particular terminal taxon or any particular pair of taxa. This conjecture had so far only been answered for two-state data by Fischer and Thatte. Here, we focus on answering the biologically more relevant case with four states, which corresponds to ancestral…

Equations601

A * B : = {A \cap B, A \cup B, if A \cap B \neq = \emptyset, otherwise.

A * B : = {A \cap B, A \cup B, if A \cap B \neq = \emptyset, otherwise.

R A (X) : = R \subseteq A α \in R \sum \frac{1}{∣ R ∣} \cdot P (MP (f, T) = R ∣ ρ = α) .

R A (X) : = R \subseteq A α \in R \sum \frac{1}{∣ R ∣} \cdot P (MP (f, T) = R ∣ ρ = α) .

R A (X) =

R A (X) =

+ \frac{1}{3} \cdot (P_{α β γ} (X) + P_{α β δ} (X) + P_{α γ δ} (X)) + \frac{1}{4} \cdot P_{α β γ δ} (X),

P_{α} (X) : = P (MP (f, T) = {α} ∣ ρ = α),

P_{α} (X) : = P (MP (f, T) = {α} ∣ ρ = α),

P_{α β} (X) : = P (MP (f, T) = {α, β} ∣ ρ = α),

P_{α γ} (X) : = P (MP (f, T) = {α, γ} ∣ ρ = α),

P_{α δ} (X) : = P (MP (f, T) = {α, δ} ∣ ρ = α),

P_{α β γ} (X) : = P (MP (f, T) = {α, β, γ} ∣ ρ = α),

P_{α β δ} (X) : = P (MP (f, T) = {α, β, δ} ∣ ρ = α),

P_{α γ δ} (X) : = P (MP (f, T) = {α, γ, δ} ∣ ρ = α),

P_{α β γ δ} (X) : = P (MP (f, T) = {α, β, γ} ∣ ρ = α) .

R A (X) \geq 1 - 3 p .

R A (X) \geq 1 - 3 p .

ρ = α \to y_{1} = α \to l = β,

ρ = α \to y_{1} = α \to l = β,

ρ = α \to y_{1} = β \to l = β,

ρ = α \to y_{1} = γ \to l = β,

ρ = α \to y_{1} = δ \to l = β .

p

p

= p_{i} + p_{i}^{^{'}} - 4 p_{i} p_{i}^{^{'}} .

P_{α β} (X) = P_{α γ} (X) = P_{α δ} (X),

P_{α β} (X) = P_{α γ} (X) = P_{α δ} (X),

P_{α β γ} (X) = P_{α β δ} (X) = P_{α γ δ} (X),

P_{α β} (X) = P (MP (f, T) = {α, β} ∣ ρ = α) = P (MP (f, T) = {α, γ} ∣ ρ = α) = P_{α γ} (X) .

P_{α β} (X) = P (MP (f, T) = {α, β} ∣ ρ = α) = P (MP (f, T) = {α, γ} ∣ ρ = α) = P_{α γ} (X) .

R A (X) =

R A (X) =

P_{β} (X) : = P (MP (f, T) = {β} ∣ ρ = α),

P_{β} (X) : = P (MP (f, T) = {β} ∣ ρ = α),

P_{γ} (X) : = P (MP (f, T) = {γ} ∣ ρ = α),

P_{δ} (X) : = P (MP (f, T) = {δ} ∣ ρ = α),

P_{β γ} (X) : = P (MP (f, T) = {β, γ} ∣ ρ = α),

P_{β γ δ} (X) : = P (MP (f, T) = {β, γ, δ} ∣ ρ = α) .

P_{β} (X) = P_{γ} (X) = P_{δ} (X) .

P_{β} (X) = P_{γ} (X) = P_{δ} (X) .

P_{α} (X) \geq P_{β} (X),

P_{α} (X) \geq P_{β} (X),

P_{α β} (X) \geq P_{β γ} (X),

P_{α β γ} (X) \geq P_{β γ δ} (X) .

P_{(A)} (Y_{i})

P_{(A)} (Y_{i})

= (1 - 3 p_{i}) P (MP (f_{Y_{i}}, T_{i}) = A ∣ y_{i} = α) + p_{i} P (MP (f_{Y_{i}}, T_{i}) = A ∣ y_{i} = β)

+ p_{i} P (MP (f_{Y_{i}}, T_{i}) = A ∣ y_{i} = γ) + p_{i} P (MP (f_{Y_{i}}, T_{i}) = A ∣ y_{i} = δ) .

P_{(α)} (Y_{i}) = (1 - 3 p_{i}) P_{α} (Y_{i}) + 3 p_{i} P_{β} (Y_{i}),

P_{(α)} (Y_{i}) = (1 - 3 p_{i}) P_{α} (Y_{i}) + 3 p_{i} P_{β} (Y_{i}),

P_{(β)} (Y_{i}) = (1 - p_{i}) P_{β} (Y_{i}) + p_{i} P_{α} (Y_{i}) = P_{(γ)} (Y_{i}) = P_{(δ)} (Y_{i}),

P_{(α β)} (Y_{i}) = (1 - 2 p_{i}) P_{α β} (Y_{i}) + 2 p_{i} P_{β γ} (Y_{i}) = P_{(α γ)} (Y_{i}) = P_{(α δ)} (Y_{i}),

P_{(β γ)} (Y_{i}) = (1 - 2 p_{i}) P_{β γ} (Y_{i}) + 2 p_{i} P_{α β} (Y_{i}) = P_{(β δ)} (Y_{i}) = P_{(γ δ)} (Y_{i}),

P_{(α β γ)} (Y_{i}) = (1 - p_{i}) P_{α β γ} (Y_{i}) + p_{i} P_{β γ δ} (Y_{i}) = P_{(α β δ)} (Y_{i}) = P_{(α γ δ)} (Y_{i}),

P_{(β γ δ)} (Y_{i}) = (1 - 3 p_{i}) P_{β γ δ} (Y_{i}) + 3 p_{i} P_{α β γ} (Y_{i}),

P_{(α β γ δ)} (Y_{i}) = P_{α β γ δ} (Y_{i}) .

P_{α} (X) =

P_{α} (X) =

+ 3 P_{(α)} (Y_{1}) P_{(α β γ)} (Y_{2}) + 3 P_{(α β γ)} (Y_{1}) P_{(α)} (Y_{2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Fractal and DNA sequence analysis

Full text

On the accuracy of ancestral sequence reconstruction for ultrametric trees with parsimony

Lina [email protected]

Mareike [email protected]

Institute for Mathematics and Computer Science, Greifswald University, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany

Abstract

We examine a mathematical question concerning the reconstruction accuracy of the Fitch algorithm for reconstructing the ancestral sequence of the most recent common ancestor given a phylogenetic tree and sequence data for all taxa under consideration. In particular, for the symmetric 4-state substitution model which is also known as Jukes-Cantor model, we answer affirmatively a conjecture of Li, Steel and Zhang which states that for any ultrametric phylogenetic tree and a symmetric model, the Fitch parsimony method using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any particular terminal taxon or any particular pair of taxa. This conjecture had so far only been answered for two-state data by Fischer and Thatte. Here, we focus on answering the biologically more relevant case with four states, which corresponds to ancestral sequence reconstruction from DNA or RNA data.

keywords:

Maximum Parsimony , ancestral sequence reconstruction , reconstruction accuracy , symmetric 4-state model

MSC:

[2010] 00-01, 99-00

††journal: Journal of Theoretical Biology

1 Introduction

The reconstruction of ancestral sequences, e.g. DNA-sequences of common ancestors of present-day species, is an important approach in understanding the evolution and origin of these species [1, 2, 3]. There exist various methods to do such reconstructions, e.g the Fitch algorithm [4, 5, 6], which is based on the Maximum Parsimony criterion. However, how reliable is such a reconstruction?

Several studies analyzed the reliability, the so-called reconstruction accuracy, of the Fitch algorithm for reconstructing ancestral sequence data of the most recent common ancestor given a phylogenetic tree and sequences for all taxa under consideration [1, 7, 8]. It seems intuitive that the root state is more likely to be conserved for taxa that are closer to the root, since over time more sequence changes can occur. Moreover, one might expect that the reconstruction accuracy is highest when all taxa are taken into account, which was also suggested by earlier simulation studies [9]. However, it can be shown that there are cases in which the reconstruction accuracy improves when only a subset of taxa is considered [1, 7]. In particular, the reconstruction accuracy can even improve when a taxon close to the root is ignored [7].

Despite these counterintuitive results, in 2008 Li et al. conjectured that for any rooted binary ultrametric phylogenetic tree (i.e. a tree in which all branches have the same distance to the root) and a simple model of evolution, the Fitch algorithm using all taxa for ancestral state reconstruction is at least as accurate as using a single taxon [1]. Note that ultrametric trees are also often referred to as clocklike trees or molecular clocks. So the conjecture by Li et al. means that under a molecular clock, the reconstruction accuracy is at least as good as the conservation probability of any taxon. Note that under a molecular clock all taxa have the same conversation probability, and that this conjecture provides a lower bound on the reconstruction accuracy for any rooted binary ultrametric phylogenetic tree under a simple model of evolution. Ignoring all data besides the data of one species displays the extreme case of throwing information away. Thus, showing that the conjecture holds is good news for Maximum Parsimony as a criterion for ancestral state reconstruction.

In 2009, Fischer and Thatte [7] proved the conjecture for two-state characters, but it remained unclear if it also holds for 4-state data like DNA or RNA. Thus, the aim of this paper is to consider this biologically relevant case with four states. In particular, we answer the conjecture affirmatively. Additionally, we also prove that the conjecture holds for three-state characters. Along the way, we also prove that the Fitch parsimony method applied to all taxa is always at least as good as applied to any pair of taxa if the underlying tree is clocklike. However, we also show that this does not improve the lower bound induced by single leaves.

2 Preliminaries

Before we can present our results, we first have to introduce some basic concepts. Recall that a rooted binary phylogenetic tree on the leaf set $X$ ( $|X|=n\geq 2$ ) is a connected, acyclic graph in which the vertices of degree 1 are called leaves, and in which there is exactly one node $\rho$ of degree 2, which is referred to as root, and all other non-leaf nodes have degree 3. Moreover, in a rooted binary phylogenetic $X$ -tree the leaves are bijectively labelled by the elements of $X$ . Let each vertex of the tree be assigned a state element of a finite state set ${\mathcal{A}}$ with $|{\mathcal{A}}|\geq 2$ . In particular, we are interested in the biologically relevant case with four states, e.g. ${\mathcal{A}}=\{\alpha,\beta,\gamma,\delta\}$ , which corresponds for instance to DNA or RNA data.

The states evolve from $\rho$ by the well-known symmetric $r$ -state model $N_{r}$ with alphabet ${\mathcal{A}}=\{\alpha_{1},\dots,\alpha_{r}\}$ [4]. In this model, a state of ${\mathcal{A}}$ is selected as the root state with probability $\frac{1}{|{\mathcal{A}}|}$ . Assume that $e=(u,v)$ is an edge of the tree, and node $u$ is closer to the root than $v$ . Then in this model, $p_{e}$ is the substitution probability on edge $e$ : it is the probability that $v$ is in some state $\alpha$ under the condition that $u$ is in a distinct state, say, $\beta$ . This is denoted by ${\mathbb{P}}(v=\alpha|u=\beta)$ . The model is supposed to be symmetric, thus $p_{e}={\mathbb{P}}(v=\alpha|u=\beta)={\mathbb{P}}(v=\beta|u=\alpha)$ . Furthermore, we assume that $0\leq p_{e}\leq\frac{1}{|{\mathcal{A}}|}$ , in particular for four states we have $0\leq p_{e}\leq\frac{1}{4}$ . The biologically relevant case with four states, namely the $N_{4}$ -model, is also often referred to as Jukes-Cantor-model [10].

Similar as in [7, 11], we consider ultrametric trees, often known as clocklike trees or molecular clocks by biologists. It means that the expected number of substitutions from the root to any leaf is the same [5].

In this manuscript we reconstruct ancestral states by the Maximum Parsimony criterion with the Fitch algorithm, which we briefly explain now. Assume that we have a rooted binary tree with leaf set $X$ . To introduce the Fitch algorithm, we first consider the kind of data we will map onto the leaves of the tree. The data is given by a character on a leaf set $X$ , which is a function $f:X\rightarrow{\mathcal{A}}$ . Thus, each leaf is assigned a character state. Note that as we consider $X=\{1,\dots,n\}$ , we often write $f=f(1)f(2)\dots f(n)$ instead of listing $f(1),\ldots,f(n)$ explicitly.

Then the Fitch algorithm [6] assigns a set of states to all interior vertices by minimizing the number of changes. The algorithm is based on Fitch’s parsimony operation. Therefore, let ${\mathcal{A}}$ be a non-empty finite alphabet and let $A,B\subseteq{\mathcal{A}}$ . Then, Fitch’s parsimony operation $*$ is defined by

[TABLE]

Using this operation, the Fitch algorithm works as follows. Consider all vertices $v$ , whose two direct descendants have already been assigned a set, say $A$ and $B$ . Then, $v$ is assigned $A*B$ . This step is continued upwards along the tree until the root $\rho$ is assigned a set, which is denoted by ${\mathtt{MP}}(f,T)$ . An example can be seen in Figure 1.

Note that what we call the Fitch algorithm is in fact only one phase of the algorithm, but it is the only part we require to estimate potential root states. For more details we refer to [6].

For a 4-state-character there are $2^{4}-1=15$ possible sets for each interior vertex, since 16 is the cardinality of the power set of an alphabet with four elements minus one for the empty set, i.e.: $\{\alpha\},\{\beta\},\{\gamma\},\{\delta\},\{\alpha,\beta\},\dots,\{\alpha,\beta,\gamma,\delta\}$ .

We say that the Fitch algorithm unambiguously reconstructs the root state if $|{\mathtt{MP}}(f,T)|=1$ . Otherwise the root state is reconstructed ambiguously, i.e. the method cannot decide between different states and therefore $|{\mathtt{MP}}(f,T)|>1$ .

Note that real data usually comes in the form of an alignment, i.e. a sequence of characters, rather than in the form of an individual character. In this case, the Fitch algorithm would consider each character, i.e. each column (“site”) of the alignment, separately. This is why we focus on the case of a single character and its reconstruction accuracy.

3 The accuracy of ancestral sequence reconstruction with 4-state characters

Similar to Li et al., we now define the reconstruction accuracy for all $|{\mathcal{A}}|\geq 2$ [1]. Therefore, let ${\mathtt{MP}}(f,T)$ denote the set of character states chosen by the Fitch algorithm as possible root states when applied to character $f$ on tree $T$ .

Let $\mathcal{R}\subseteq{\mathcal{A}},\alpha\in\mathcal{R}$ and $|\mathcal{R}|\geq 1$ . The probability that the root state $\alpha$ evolves on $T$ to a character $f$ for which the Fitch algorithm assigns $\mathcal{R}$ as possible root state set is given by ${\mathbb{P}}({\mathtt{MP}}(f,T)=\mathcal{R}|\rho=\alpha)$ .

The reconstruction accuracy is then defined by

[TABLE]

To illustrate this definition, consider the case with ${\mathcal{A}}=\{\alpha,\beta,\gamma,\delta\}$ . In this case, the reconstruction accuracy for the Fitch algorithm for ancestral state reconstruction is given by

[TABLE]

where we define

[TABLE]

The main aim of this manuscript is to show that the reconstruction accuracy for a rooted binary ultrametric phylogenetic tree under the $N_{4}$ -model using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any particular terminal taxon. This provides a lower bound on $RA(X)$ , and is stated in the following theorem.

Theorem 1.

For any rooted binary phylogenetic ultrametric tree and the $N_{4}$ -model, the Fitch algorithm using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any particular terminal taxon, that is

[TABLE]

The proof of Theorem 1 requires some more general properties. Therefore, we first turn our attention to the following. If not stated otherwise, we always consider rooted binary ultrametric phylogenetic trees under the $N_{4}$ -model. Due to the symmetry of the model, we can assume without loss of generality that the root is in state $\alpha$ , so $\alpha$ evolves along the tree to a character $f$ on $X$ . Let $p$ be the probability that from the root to one leaf the state changes from $\alpha$ to one specific state in ${\mathcal{A}}\setminus\{\alpha\}=\{\beta,\gamma,\delta\}$ , i.e. $3p$ is the probability that a given leaf is not in state $\alpha$ .

Therefore, in the case of the $N_{4}$ -model, $1-3p$ is the probability that the root is in the same state as one leaf, since three different changes ( $\alpha\rightarrow\beta,\alpha\rightarrow\gamma,\alpha\rightarrow\delta$ ) can occur. This is at the same time the reconstruction accuracy when only one leaf is taken into account. The main aim of this paper is to show that $1-3p$ is a lower bound for $RA(X)$ ; that is considering all taxa under a molecular clock is always better, or as good as, considering just one taxon.

As shown in Figure 2, every binary tree $T$ can be decomposed into two maximal pending subtrees $T_{1}$ and $T_{2}$ with leaf sets $Y_{1}$ and $Y_{2}$ ( $X=Y_{1}\cup Y_{2},Y_{1}\cap Y_{2}=\emptyset$ ). This is the so-called standard decomposition [5]. We denote the children of $\rho$ by $y_{1}$ and $y_{2}$ , and with probability $p_{i}$ one specific change occurs from $\rho$ to $y_{i}$ ( $i\in\{1,2\}$ ). Analogously, one specific change occurs from $y_{i}$ to any leaf with probability $p_{i}^{{}^{\prime}}$ ( $i\in\{1,2\}$ ). Note that $p$ can then be calculated by all possibilities given for one specific change from $\rho$ to any leaf. Suppose that the root is in state $\alpha$ and leaf $l$ in state $\beta$ (without loss of generality we have $l\in Y_{1}$ ). Then there are four different possibilities for a change from $\rho=\alpha$ to $l=\beta$ :

[TABLE]

Thus,

[TABLE]

Furthermore, for $i\in\{1,2\}$ we define $P_{i}\coloneqq 1-4p_{i}$ , and similarly $P\coloneqq 1-4p$ .

Under the model assumptions of the $N_{4}$ -model, due to the symmetry, we have that

[TABLE]

since e.g.

[TABLE]

Therefore by (2), (4) and (5), $RA(X)$ can be simplified and becomes

[TABLE]

Moreover, we define

[TABLE]

Again, by the symmetry of the $N_{4}$ -model, we obtain

[TABLE]

Biologically this means that under the assumption that $\alpha$ is the true root state, the probability that $\alpha$ evolves to a character for which the Fitch algorithm assigns $\{\beta\}$ to the root is the same as for $\{\gamma\}$ and $\{\delta\}$ , since each specific change occurs with probability $p$ .

This brings us to our next result, where $P_{\alpha}(X),P_{\beta}(X),P_{\alpha\beta}(X),P_{\beta\gamma}(X),P_{\alpha\beta\gamma}(X)$ and $P_{\beta\gamma\delta}(X)$ are linked to each other.

Lemma 1.

For any rooted binary phylogenetic tree and the $N_{4}$ -model we have that

[TABLE]

Note that Lemma 1 does not require the underlying tree to be ultrametric.

The proof of Lemma 1 is by induction on $n$ and is presented in the appendix. For this proof and also for the proof of Theorem 1 we state some recursions required for the induction. Therefore, we define $f_{Y_{i}}$ as a restriction of $f$ to $Y_{i}\subseteq X$ for $i\in\{1,2\}$ : $f_{Y_{i}}\coloneqq f|_{Y_{i}}$ . For $i\in\{1,2\}$ the probability $P_{(A)}(Y_{i})$ to obtain a set $A\in\{\{\alpha\},\{\beta\},\{\alpha,\beta\},\{\beta,\gamma\},\{\alpha,\beta,\gamma\},\{\beta,\gamma,\delta\},\{\alpha,\beta,\gamma,\delta\}\}$ as estimate state for $y_{i}$ with the Fitch algorithm under the assumption that $\rho$ is in state $\alpha$ can be defined using the law of total probability:

[TABLE]

Then with (4),(5),(7) we have:

[TABLE]

With (8), (9), (10), (11), (12), (13) and (14) we therefore have

[TABLE]

As stated before, all these recursions are needed for the proof of Lemma 1 and Theorem 1. Now, we are in the position to prove Theorem 1, our main result, which states a lower bound on $RA(X)$ .

Proof.

The proof is by induction on $n$ . In order to show $RA(X)\geq 1-3p$ , we define $D(X)\coloneqq RA(X)-(1-3p)$ , and show that $D(X)$ is non-negative.

For $n=2$ the subtrees $Y_{1}$ and $Y_{2}$ both contain one leaf, and thus

[TABLE]

This shows that $D(X)=RA(X)-(1-3p)=0$ is non-negative and thus $RA(X)=1-3p$ , which completes the base case of the induction.

Now, we show by induction that $D(X)$ is non-negative. Suppose that $T$ has $n$ taxa and that $D(X)$ is non-negative for all trees having fewer than $n$ taxa. We define $D_{i}\coloneqq D(Y_{i})=RA(Y_{i})-(1-3p_{i}^{{}^{\prime}})$ for $i\in\{1,2\}$ . Thus, $D_{1}$ and $D_{2}$ are non-negative since $Y_{1}$ and $Y_{2}$ contain both fewer than $n$ taxa.

By elementary term conversion we can show that

[TABLE]

The exact conversions can be found in the appendix.

Moreover, note that $P_{i},P_{(\alpha)}(Y_{i}),P_{(\alpha\beta)}(Y_{i}),P_{(\alpha\beta\gamma)}(Y_{i}),P_{(\alpha\beta\gamma\delta)}(Y_{i})$ are all probabilities and therefore are all non-negative for $i\in\{1,2\}$ . By Lemma 1 we have that (for $i\in\{1,2\}$ ) $P_{\alpha\beta}(Y_{i})-P_{\beta\gamma}(Y_{i})$ and $P_{\alpha\beta\gamma}(Y_{i})-P_{\beta\gamma\delta}(Y_{i})$ are non-negative, resulting in (19) being non-negative. This implies $D(X)\geq 0$ and thus $RA(X)\geq 1-3p$ . This completes the proof. ∎

We have shown that the reconstruction accuracy using all terminal taxa is always greater or equal than the conservation probability of one single taxon. Moreover, the base case of the proof of Theorem 1 provides more insight into the reconstruction accuracy of using 2-taxon trees under the $N_{4}$ -model.

Corollary 1.

Let $T$ be a rooted binary ultrametric phylogenetic tree on taxon set $X$ with $|X|=2$ . Let $p$ denote the probability of change from the root to any leaf under the $N_{4}$ -model. Then, the reconstruction accuracy for ancestral state reconstruction using the Fitch algorithm is given by

[TABLE]

Corollary 1 states the reconstruction accuracy for ancestral state reconstruction with the Fitch algorithm using ultrametric 2-taxon trees, which is the same probability when using one terminal taxon. In the following proposition we show that the reconstruction accuracy with the Fitch algorithm using any two terminal taxa of a taxa set $X$ is also $1-3p$ .

Proposition 1.

For any rooted binary phylogenetic ultrametric tree and the $N_{4}$ -model, the reconstruction accuracy for the Fitch algorithm using any two terminal taxa $x_{1},x_{2}\in X$ for ancestral state reconstruction is given by

[TABLE]

Proof.

Let $x_{1},x_{2}\in X$ be two terminal taxa of any rooted binary ultrametric phylogenetic tree $T$ . Moreover, we consider the standard decomposition of $T$ into its two maximal pending subtrees $T_{1}$ and $T_{2}$ as depicted in Figure 2. Thus, the proof is divided into two cases.

In the first case we have without loss of generality $x_{1}\in Y_{1}$ and $x_{2}\in Y_{2}$ . By Corollary 1 the reconstruction accuracy using $x_{1}$ and $x_{2}$ is then $RA(\{x_{1},x_{2}\})=1-3p$ .

In the second case we have either $x_{1},x_{2}\in Y_{1}$ or $x_{1},x_{2}\in Y_{2}$ . Thus, without loss of generality we consider $x_{1},x_{2}\in Y_{1}$ as depicted in Figure 3. Let $y$ be the last common ancestor of $x_{1}$ and $x_{2}$ , i.e. the first node that occurs both on the path from $x_{1}$ to $\rho$ as well as on the path from $x_{2}$ to $\rho$ . Let $\widehat{T}$ be the subtree of $T_{1}$ that consists of the paths from $y$ to $x_{1}$ and $x_{2}$ , respectively, as well as all vertices which lie on one of these paths. $\widehat{T}$ is depicted with dotted lines in Figure 3. Thus, the root of $\widehat{T}$ is $y$ . In addition, let $\overline{p}$ be the probability for one specific change from $\rho$ to $y$ , and let $\widehat{p}$ be the probability for one specific change from $y$ to $x_{1}$ or $x_{2}$ .

By (6) we have

[TABLE]

Note that $P_{\alpha\beta\gamma}(\{x_{1},x_{2}\})=P_{\alpha\beta\gamma\delta}(\{x_{1},x_{2}\})=0$ since we cannot obtain sets with more than two elements with the Fitch algorithm when only $x_{1}$ and $x_{2}$ are used for the reconstruction.

In the following, we use the notation $f|_{\{x_{1},x_{2}\}}$ for the restriction of character $f$ on taxa $x_{1}$ and $x_{2}$ .

Furthermore, we have

[TABLE]

Moreover,

[TABLE]

Thus by 21 and (22), (20) becomes

[TABLE]

Therefore, in both cases $RA(\{x_{1},x_{2}\})=1-3p$ which completes the proof. ∎

This proposition provides us the reconstruction accuracy for the Fitch algorithm when any two terminal taxa are considered. Note that this reconstruction accuracy is the same as when only one terminal taxon is taken into account. Therefore, by Theorem 1 and Proposition 1 we have the following corollary, which states that the lower bound on the reconstruction accuracy holds for any two terminal taxa. In particular, considering two taxa rather than one cannot improve the lower bound given by Theorem 1.

Corollary 2.

For any rooted binary phylogenetic ultrametric tree and the $N_{4}$ -model, the Fitch algorithm using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any two terminal taxa, that is

[TABLE]

This statement completes Section 3, and we now have a look on similar results obtained for the $N_{3}$ -model.

4 The accuracy of ancestral sequence reconstruction with 3-state characters

Under the same assumptions as for the 4-state model, similar results can be obtained for the 3-state alphabet ${\mathcal{A}}=\{\alpha,\beta,\gamma\}$ . In this case, the reconstruction accuracy is given by

[TABLE]

Then Theorem 2 and Lemma 2 can be formulated similarly to the statements before. Both proofs are left out, since they can be done analogously. However, we want to emphasize that the conjecture stated by Li et al. also holds for the $N_{3}$ -model.

Theorem 2.

For any rooted binary phylogenetic ultrametric tree and the $N_{3}$ -model, the Fitch algorithm using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any particular terminal taxon, that is

[TABLE]

By Theorem 2, a lower bound on $RA(X)$ for rooted binary ultrametric phylogenetic trees is also given for ${\mathcal{A}}=\{\alpha,\beta,\gamma\}$ .

Note that the analogs of Lemma 1, Corollary 1, Proposition 1 and Corollary 2 also hold under the $N_{3}$ -model. In particular, the reconstruction accuracy for ultrametric trees is then at least $1-2p$ . The exact statements and their proofs can be found in the appendix.

5 Conclusion and Discussion

In this paper we considered the reconstruction accuracy of the Fitch algorithm for ancestral state reconstruction. In particular, we analyzed rooted binary ultrametric phylogenetic trees under the $N_{4}$ -model. For an ultrametric tree the probability of a change from the root to any leaf is the same. For such trees, we investigated a lower bound on the reconstruction accuracy by answering affirmatively the conjecture by Li, Steel and Zhang, which stated that for rooted binary ultrametric phylogenetic trees under the symmetric $N_{r}$ -model the reconstruction accuracy using all terminal taxa is at least as high as the conservation probability of any leaf. In 2009, Fischer and Thatte had already shown that this conjecture holds for two-state characters, but it remained unknown whether this result could be extended to three or more character states. In particular, the biologically relevant case of $r=4$ , which corresponds to the DNA- or RNA-alphabet, remained unclear.

The main result of this manuscript is the proof of the conjecture for $r=4$ , which provides a lower bound on the reconstruction accuracy. As mentioned before, the conjecture also holds for the $N_{3}$ -model. In the past, several studies showed that in some cases, the Fitch algorithm provides better results when some data are disregarded [1, 7]. This led to a critical view on Maximum Parsimony as a method for ancestral state reconstruction. But as we have shown here, at least for ultrametric trees, the extreme case of disregarding all data except for one or two leaves can never improve the reconstruction accuracy of the Fitch algorithm. In this sense, our results are good news for Maximum Parsimony as a method for ancestral state reconstruction.

To conclude, the generalization to the $N_{r}$ -model for $r>4$ is still open, but we conjecture that it also holds.

6 Appendix

Proof of Lemma 1 To prove Lemma 1 we show that for any rooted binary phylogenetic tree $T$ under a symmetric 4-state substitution model

[TABLE]

by induction on $n$ . For $n=2$ the subtrees $Y_{1}$ and $Y_{2}$ both contain one leaf, and hence $p=p_{1}=p_{2}$ leads to

[TABLE]

Therefore

[TABLE]

Moreover

[TABLE]

and

[TABLE]

which completes the base case of the induction. For the inductive step we first state some more recursions using (9), (10), (11), (12) and (13):

[TABLE]

Moreover we have that for $i\in\{1,2\}$

[TABLE]

and thus

[TABLE]

In the same manner by (10), (11), (12) and (13) we can see that

[TABLE]

Therefore

[TABLE]

Additionally we have the following: choose sets $A_{1},A_{2}$ from $\{\{\alpha\},\{\alpha\beta\},\{\alpha\beta\gamma\}\}$ and $B_{1},B_{2}$ from $\{\{\beta\},\{\beta\gamma\},\{\beta\gamma\delta\}\}$ such that for $i\in\{1,2\}$ $|A_{i}|=|B_{i}|$ , respectively. Then we have that

[TABLE]

Now suppose that $T$ has $n$ taxa and that (24),(25) and (26) are true for all trees having fewer that $n$ taxa. Note that therefore (36) is non-negative, since $Y_{1}$ and $Y_{2}$ contain both fewer than than $n$ taxa. Then

[TABLE]

By (36) and the inductive assumption this term is non-negative, and therefore concludes the proof for $P_{\alpha}(X)\geq P_{\beta}(X)$ . We now proceed with the second part of Lemma 1.

[TABLE]

Again by (36) and the inductive assumption this term is non-negative, and therefore concludes the proof for $P_{\alpha\beta}(X)\geq P_{\beta\gamma}(X)$ . Moreover we have

[TABLE]

By (36) and the inductive assumption $P_{\alpha\beta\gamma}(X)-P_{\beta\gamma\delta}(X)$ is non-negative, and therefore concludes the proof of the last part of Lemma 1. ∎

Extension to the proof of Theorem 1 First of all we state some equations for $i\in\{1,2\}$ which helps to show (19).

[TABLE]

By (37) we have that

[TABLE]

and

[TABLE]

Furthermore, the following expressions can be simplified by (6), (38) and (39).

[TABLE]

Moreover,

[TABLE]

Additionally,

[TABLE]

and

[TABLE]

Furthermore,

[TABLE]

and

[TABLE]

By using the simplifications stated before we can now rewrite $RA(X)$ .

[TABLE]

Then

[TABLE]

Lemma 2.

For any rooted binary phylogenetic tree and the $N_{3}$ -model we have that

[TABLE]

Note that Lemma 2 does also not require the underlying tree to be ultrametric.

Proof.

To prove Lemma 2 we show that for any rooted binary phylogenetic tree $T$ under a symmetric 3-state substitution model

[TABLE]

by induction on $n$ . For $n=2$ the subtrees $Y_{1}$ and $Y_{2}$ both contain one leaf, and hence $p=p_{1}=p_{2}$ leads to

[TABLE]

Therefore

[TABLE]

Moreover

[TABLE]

which completes the base case of the induction. For the inductive step we first define some recursions similar to (8), (9), (10), (11) and (12):

[TABLE]

With (48), (49), (50), (51) and (52) we therefore have:

[TABLE]

Moreover we have that for $i\in\{1,2\}$

[TABLE]

and thus

[TABLE]

In the same manner by (50) and (51) we can see that

[TABLE]

Therefore

[TABLE]

Additionally we have the following: choose sets $A_{1},A_{2}$ from $\{\{\alpha\},\{\alpha\beta\}\}$ and $B_{1},B_{2}$ from $\{\{\beta\},\{\beta\gamma\}\}$ such that for $i\in\{1,2\}$ $|A_{i}|=|B_{i}|$ , respectively. Then we have that

[TABLE]

Now suppose that $T$ has $n$ taxa and that (46) and (47) are true for all trees having fewer that $n$ taxa. Note that therefore (61) is non-negative, since $Y_{1}$ and $Y_{2}$ contain both fewer than than $n$ taxa. Then

[TABLE]

By the inductive assumption this term is non-negative, and therefore concludes the proof for $P_{\alpha}(X)\geq P_{\beta}(X)$ . We now proceed with the second part of Lemma 2.

[TABLE]

By inductive assumption $P_{\alpha\beta}(X)-P_{\beta\gamma}(X)$ is non-negative, and therefore concludes the proof of the second part of Lemma 2. ∎

Corollary 3.

Let $T$ be a rooted binary ultrametric phylogenetic tree on taxon set $X$ with $|X|=2$ . Let $p$ denote the probability of change from the root to any leaf under the $N_{3}$ -model. Then, the reconstruction accuracy for ancestral state reconstruction using the Fitch algorithm is given by

[TABLE]

Proposition 2.

For any rooted binary phylogenetic ultrametric tree and the $N_{3}$ -model, the reconstruction accuracy for the Fitch algorithm using any two terminal taxa $x_{1},x_{2}\in X$ for ancestral state reconstruction is given by

[TABLE]

Proof.

Let $x_{1},x_{2}\in X$ be two terminal taxa of any rooted binary ultrametric phylogenetic tree $T$ . Moreover, we consider the standard decomposition of $T$ into its two maximal pending subtrees $T_{1}$ and $T_{2}$ as depicted in Figure 2. Thus, the proof is divided into two cases.

In the first case we have without loss of generality $x_{1}\in Y_{1}$ and $x_{2}\in Y_{2}$ . By Corollary 3 the reconstruction accuracy using $x_{1}$ and $x_{2}$ is then $RA(\{x_{1},x_{2}\})=1-2p$ .

In the second case we have either $x_{1},x_{2}\in Y_{1}$ or $x_{1},x_{2}\in Y_{2}$ . Thus, without loss of generality we consider $x_{1},x_{2}\in Y_{1}$ as depicted in Figure 3. Let $y$ be the last common ancestor of $x_{1}$ and $x_{2}$ , i.e. the first node that occurs both on the path from $x_{1}$ to $\rho$ as well as on the path from $x_{2}$ to $\rho$ . Let $\widehat{T}$ be the subtree of $T_{1}$ consisting of the paths from $y$ to $x_{1}$ and $x_{2}$ , respectively. $\widehat{T}$ is depicted with dotted lines in Figure 3. Thus, the root of $\widehat{T}$ is $y$ . In addition, let $\overline{p}$ be the probability for one specific change from $\rho$ to $y$ , and let $\widehat{p}$ be the probability for one specific change from $y$ to $x_{1}$ or $x_{2}$ . By (23) we have

[TABLE]

Note that $P_{\alpha\beta\gamma}(\{x_{1},x_{2}\})=0$ since we cannot obtain sets with more than two elements with the Fitch algorithm when only $x_{1}$ and $x_{2}$ are used for the reconstruction.

In the following, we use the notation $f|_{\{x_{1},x_{2}\}}$ for the restriction of character $f$ on taxa $x_{1}$ and $x_{2}$ .

Furthermore, we have

[TABLE]

Moreover,

[TABLE]

Thus by 63 and (64), (62) becomes

[TABLE]

Therefore, in both cases $RA(\{x_{1},x_{2}\})=1-2p$ which completes the proof. ∎

Corollary 4.

For any rooted binary phylogenetic ultrametric tree and the $N_{3}$ -model, the Fitch algorithm using all terminal taxa is more accurate, or at least as accurate, for ancestral state reconstruction than using any two terminal taxa, that is

[TABLE]

Bibliography11

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] G. Li, M. Steel, L. Zhang, More taxa are not necessarily better for the reconstruction of ancestral character states, Systematic Biology 57 (4) (2008) 647–653. doi:10.1080/10635150802203898 . · doi ↗
2[2] D. A. Liberles i (ed), Ancestral Sequence Reconstruction, Oxford University Press, 2007.
3[3] J. Yang, J. Li, L. Dong, S. Grünewald, Analysis on the reconstruction accuracy of the fitch method for inferring ancestral states, BMC Bioinformatics 12 (18). doi:10.1186/1471-2105-12-18 . · doi ↗
4[4] C. Tuffley, M. Steel, Links between maximum likelihood and maximum parsimony under a simple model of site substitution, Bulletin of Mathematical Biology 59 (3) (1997) 581–607. doi:10.1007/BF 02459467 . · doi ↗
5[5] C. Semple, M. Steel, Phylogenetics, Oxford Lecture Series in Mathematics and its Application, 2003.
6[6] W. M. Fitch, Toward defining the course of evolution: Minimum change for a specific tree topology, Systematic Zoology 2 (4) (1971) 406–416. doi:10.2307/2412116 . · doi ↗
7[7] M. Fischer, B. D. Thatte, Maximum parsimony on subsets of taxa, Journal of Theoretical Biology 260 (2) (2009) 290–293. doi:10.1016/j.jtbi.2009.06.010 . · doi ↗
8[8] L. Zhang, J. Shen, J. Yang, G. Li, Analyzing the fitch method for reconstructing ancestral states on ultrametric phylogenetic trees, Bulletin of Mathematical Biology 72 (2010) 1760–1782. doi:10.1007/s 11538-010-9505-8 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On the accuracy of ancestral sequence reconstruction for ultrametric trees with parsimony

Abstract

keywords:

MSC:

1 Introduction

2 Preliminaries

3 The accuracy of ancestral sequence reconstruction with 4-state characters

Theorem 1**.**

Lemma 1**.**

Proof.

Corollary 1**.**

Proposition 1**.**

Proof.

Corollary 2**.**

4 The accuracy of ancestral sequence reconstruction with 3-state characters

Theorem 2**.**

5 Conclusion and Discussion

6 Appendix

Lemma 2**.**

Proof.

Corollary 3**.**

Proposition 2**.**

Proof.

Corollary 4**.**

Theorem 1.

Lemma 1.

Corollary 1.

Proposition 1.

Corollary 2.

Theorem 2.

Lemma 2.

Corollary 3.

Proposition 2.

Corollary 4.