Combinatorial properties of phylogenetic diversity indices

Kristina Wicke; Mike Steel

arXiv:1902.02463·q-bio.PE·October 4, 2019

Combinatorial properties of phylogenetic diversity indices

Kristina Wicke, Mike Steel

PDF

TL;DR

This paper explores the mathematical properties and relationships of phylogenetic diversity indices, focusing on Fair Proportion and Equal Splits, their equivalence, and extensions to unrooted trees, with implications for evolutionary heritage measurement.

Contribution

It characterizes when FP and ES indices differ or are identical, and examines their relationship with the Shapley Value on unrooted trees, introducing new analogues.

Findings

01

FP and ES can differ depending on tree shape

02

FP is equivalent to the Shapley Value on rooted trees

03

New indices related to Pauplin representation are introduced

Abstract

Phylogenetic diversity indices provide a formal way to apportion 'evolutionary heritage' across species. Two natural diversity indices are Fair Proportion (FP) and Equal Splits (ES). FP is also called 'evolutionary distinctiveness' and, for rooted trees, is identical to the Shapley Value (SV), which arises from cooperative game theory. In this paper, we investigate the extent to which FP and ES can differ, characterise tree shapes on which the indices are identical, and study the equivalence of FP and SV and its implications in more detail. We also define and investigate analogues of these indices on unrooted trees (where SV was originally defined), including an index that is closely related to the Pauplin representation of phylogenetic diversity.

Figures7

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Coefficients γ T ( i , e ) subscript 𝛾 𝑇 𝑖 𝑒 \gamma_{T}(i,e) used in the calculation of φ ( i ) = ∑ e γ T ( i , e ) l ( e ) 𝜑 𝑖 subscript 𝑒 subscript 𝛾 𝑇 𝑖 𝑒 𝑙 𝑒 \varphi(i)=\sum_{e}\gamma_{T}(i,e)l(e) , where i 𝑖 i is a leaf and e 𝑒 e is an edge of T 𝑇 T . Moreover, μ ( i , e ) 𝜇 𝑖 𝑒 \mu(i,e) is as in Eqn. ( 19 ), c ( i , e ) 𝑐 𝑖 𝑒 c(i,e) denotes the number of leaves on the same side of e 𝑒 e as leaf i 𝑖 i and f ( i , e ) 𝑓 𝑖 𝑒 f(i,e) denotes the number of leaves on the other side of e 𝑒 e .

	$e$ pendant edge incident with $i$	$e$ interior edge	$e$ pendant edge not incident with $i$
$φ_{ES}$	$1$	$μ (i, e)$	$0$
$φ_{Pa}$	$μ (i, e)$	$μ (i, e)$	$μ (i, e)$
$φ_{FP}$	$\frac{1}{2 c (i, e)}$	$\frac{1}{2 c (i, e)}$	$\frac{1}{2 c (i, e)}$
${\tilde{φ}}_{FP}$	$1$	$\frac{1}{2 c (i, e)}$	$0$
$φ_{SV}$	$\frac{f (i, e)}{n c (i, e)}$	$\frac{f (i, e)}{n c (i, e)}$	$\frac{f (i, e)}{n c (i, e)}$

Equations174

φ_{T} (i) = e \in E (T) \sum γ_{T} (i, e) l (e)

φ_{T} (i) = e \in E (T) \sum γ_{T} (i, e) l (e)

i \in [n] \sum γ_{T} (i, e) = 1.

i \in [n] \sum γ_{T} (i, e) = 1.

F P_{T} (i) = e \in P (T; ρ, i) \sum \frac{1}{n ( e )} l (e),

F P_{T} (i) = e \in P (T; ρ, i) \sum \frac{1}{n ( e )} l (e),

E S_{T} (i) = e \in P (T; ρ, i) \sum \frac{1}{Π ( e , i )} l (e),

E S_{T} (i) = e \in P (T; ρ, i) \sum \frac{1}{Π ( e , i )} l (e),

\Delta_{n}(FP/ES)=\max\limits_{T\,\in\,RB(n)}\;\max\limits_{i\,\in\,[n]}\;\sup\limits_{l>0}\;\Big{\{}\frac{FP_{T}(i)}{ES_{T}(i)}\Big{\}},

\Delta_{n}(FP/ES)=\max\limits_{T\,\in\,RB(n)}\;\max\limits_{i\,\in\,[n]}\;\sup\limits_{l>0}\;\Big{\{}\frac{FP_{T}(i)}{ES_{T}(i)}\Big{\}},

\Delta_{n}(ES/FP)=\max\limits_{T\,\in\,RB(n)}\;\max\limits_{i\,\in\,[n]}\;\sup\limits_{l>0}\;\Big{\{}\frac{ES_{T}(i)}{FP_{T}(i)}\Big{\}},

\Delta_{n}(ES/FP)=\max\limits_{T\,\in\,RB(n)}\;\max\limits_{i\,\in\,[n]}\;\sup\limits_{l>0}\;\Big{\{}\frac{ES_{T}(i)}{FP_{T}(i)}\Big{\}},

Δ_{n} (F P / E S) = \frac{2 ^{n - 2}}{n - 1} \mbox an d Δ_{n} (E S / F P) = \frac{n - 1}{2} .

Δ_{n} (F P / E S) = \frac{2 ^{n - 2}}{n - 1} \mbox an d Δ_{n} (E S / F P) = \frac{n - 1}{2} .

\frac{\sum _{j = 0}^{h} a _{j}}{\sum _{j = 0}^{h} b _{j}} \leq max {\frac{a _{j}}{b _{j}}, j = 0, \dots, h} .

\frac{\sum _{j = 0}^{h} a _{j}}{\sum _{j = 0}^{h} b _{j}} \leq max {\frac{a _{j}}{b _{j}}, j = 0, \dots, h} .

\frac{F P _{T} ( i )}{E S _{T} ( i )} = \frac{\sum _{j = 0}^{h} l _{j} / n _{j}}{\sum _{j = 0}^{h} l _{j} / 2 ^{j}},

\frac{F P _{T} ( i )}{E S _{T} ( i )} = \frac{\sum _{j = 0}^{h} l _{j} / n _{j}}{\sum _{j = 0}^{h} l _{j} / 2 ^{j}},

\frac{F P _{T} ( i )}{E S _{T} ( i )} \leq \frac{\sum _{j = 0}^{h} l _{j} / ( j + 1 )}{\sum _{j = 0}^{h} l _{j} / 2 ^{j}} \leq max {\frac{l _{j} / ( j + 1 )}{l _{j} / 2 ^{j}}, j = 0, \dots, h} = max {\frac{2 ^{j}}{j + 1}, j = 0, \dots, h},

\frac{F P _{T} ( i )}{E S _{T} ( i )} \leq \frac{\sum _{j = 0}^{h} l _{j} / ( j + 1 )}{\sum _{j = 0}^{h} l _{j} / 2 ^{j}} \leq max {\frac{l _{j} / ( j + 1 )}{l _{j} / 2 ^{j}}, j = 0, \dots, h} = max {\frac{2 ^{j}}{j + 1}, j = 0, \dots, h},

\frac{F P _{T} ( i )}{E S _{T} ( i )} \leq \frac{2 ^{n - 2}}{n - 1} .

\frac{F P _{T} ( i )}{E S _{T} ( i )} \leq \frac{2 ^{n - 2}}{n - 1} .

\frac{E S _{T} ( i )}{F P _{T} ( i )} = \frac{\sum _{j = 0}^{h} l _{j} / 2 ^{j}}{\sum _{j = 0}^{h} l _{j} / n _{j}} .

\frac{E S _{T} ( i )}{F P _{T} ( i )} = \frac{\sum _{j = 0}^{h} l _{j} / 2 ^{j}}{\sum _{j = 0}^{h} l _{j} / n _{j}} .

\frac{E S _{T} ( i )}{F P _{T} ( i )} \leq max {\frac{l _{j} / 2 ^{j}}{l _{j} / n _{j}}, j = 0, \dots, h} = max {\frac{n _{j}}{2 ^{j}}, j = 0, \dots, h} .

\frac{E S _{T} ( i )}{F P _{T} ( i )} \leq max {\frac{l _{j} / 2 ^{j}}{l _{j} / n _{j}}, j = 0, \dots, h} = max {\frac{n _{j}}{2 ^{j}}, j = 0, \dots, h} .

\frac{E S _{T} ( i )}{F P _{T} ( i )} \leq \frac{n - 1}{2} .

\frac{E S _{T} ( i )}{F P _{T} ( i )} \leq \frac{n - 1}{2} .

F P_{T} (i) - E S_{T} (i)

F P_{T} (i) - E S_{T} (i)

E S_{T} (i) - F P_{T} (i)

Δ_{n} (F P - E S; l_{max}) = T \in R B (n) max i \in [n] max l > 0 : m a x {l (e)} = l_{max} sup {F P_{T} (i) - E S_{T} (i)} .

Δ_{n} (F P - E S; l_{max}) = T \in R B (n) max i \in [n] max l > 0 : m a x {l (e)} = l_{max} sup {F P_{T} (i) - E S_{T} (i)} .

Δ_{n} (E S - F P; l_{max}) = T \in R B (n) max i \in [n] max l > 0 : m a x {l (e)} = l_{max} sup {E S_{T} (i) - F P_{T} (i)} .

Δ_{n} (E S - F P; l_{max}) = T \in R B (n) max i \in [n] max l > 0 : m a x {l (e)} = l_{max} sup {E S_{T} (i) - F P_{T} (i)} .

n sup Δ_{n} (E S - F P; l_{max}) = l_{max} .

n sup Δ_{n} (E S - F P; l_{max}) = l_{max} .

F P_{T} (i)

F P_{T} (i)

and

E S_{T} (i)

Δ_{n} (F P - E S; l_{max}) = l_{max} \cdot (ln n + γ - 2) + o (1),

Δ_{n} (F P - E S; l_{max}) = l_{max} \cdot (ln n + γ - 2) + o (1),

E S_{T} (i) - F P_{T} (i) = j = 1 \sum h l_{j} (\frac{1}{2 ^{j}} - \frac{1}{n _{j}}) < j = 1 \sum h l_{j} \frac{1}{2 ^{j}} \leq l_{max} \cdot j = 1 \sum h \frac{1}{2 ^{j}} < l_{max} .

E S_{T} (i) - F P_{T} (i) = j = 1 \sum h l_{j} (\frac{1}{2 ^{j}} - \frac{1}{n _{j}}) < j = 1 \sum h l_{j} \frac{1}{2 ^{j}} \leq l_{max} \cdot j = 1 \sum h \frac{1}{2 ^{j}} < l_{max} .

E S_{T_{n}} (i) - F P_{T_{n}} (i) = l_{max} \cdot j = 1 \sum k_{n} (\frac{1}{2 ^{j}} - \frac{1}{n _{j}}) .

E S_{T_{n}} (i) - F P_{T_{n}} (i) = l_{max} \cdot j = 1 \sum k_{n} (\frac{1}{2 ^{j}} - \frac{1}{n _{j}}) .

j = 1 \sum k_{n} \frac{1}{n _{j}} \leq \frac{1}{k _{n}} \cdot j = 1 \sum k_{n} \frac{1}{j} \sim \frac{ln ( k _{n} )}{k _{n}} \to 0,

j = 1 \sum k_{n} \frac{1}{n _{j}} \leq \frac{1}{k _{n}} \cdot j = 1 \sum k_{n} \frac{1}{j} \sim \frac{ln ( k _{n} )}{k _{n}} \to 0,

n_{j} \geq 2^{⌈ \sum_{k = 0}^{j - 1} l_{k} / l_{max} ⌉} .

n_{j} \geq 2^{⌈ \sum_{k = 0}^{j - 1} l_{k} / l_{max} ⌉} .

F P_{T} (i) - E S_{T} (i) \leq j = 1 \sum h l_{j} (2^{- ⌈ \sum_{k = 0}^{j - 1} l_{k} / l_{max} ⌉} - 2^{- j}) .

F P_{T} (i) - E S_{T} (i) \leq j = 1 \sum h l_{j} (2^{- ⌈ \sum_{k = 0}^{j - 1} l_{k} / l_{max} ⌉} - 2^{- j}) .

i = 1 \sum h x_{i} 2^{- \sum_{j < i} x_{j}} \leq \frac{2}{ln 2} \cdot 2^{- x_{0}} .

i = 1 \sum h x_{i} 2^{- \sum_{j < i} x_{j}} \leq \frac{2}{ln 2} \cdot 2^{- x_{0}} .

F P_{T} (i) - E S_{T} (i) < l_{max} j = 1 \sum h x_{j} 2^{- \sum_{k = 0}^{j - 1} x_{k}} \leq l_{max} \cdot \frac{2}{ln 2},

F P_{T} (i) - E S_{T} (i) < l_{max} j = 1 \sum h x_{j} 2^{- \sum_{k = 0}^{j - 1} x_{k}} \leq l_{max} \cdot \frac{2}{ln 2},

Δ_{n} (F P - E S; L) = T \in R B (n) max i \in [n] max l > 0 : e \sum l (e) = L sup {F P_{T} (i) - E S_{T} (i)} .

Δ_{n} (F P - E S; L) = T \in R B (n) max i \in [n] max l > 0 : e \sum l (e) = L sup {F P_{T} (i) - E S_{T} (i)} .

Δ_{n} (E S - F P; L) = T \in R B (n) max i \in [n] max l > 0 : e \sum l (e) = L sup {E S_{T} (i) - F P_{T} (i)} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: K. Wicke 22institutetext: Institute of Mathematics and Computer Science, University of Greifswald, Germany. Orchid ID: 0000-0002-4275-5546 33institutetext: M. Steel 44institutetext: Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand (corresponding author). Orchid ID: 0000-0001-7015-4644

44email: [email protected]

Combinatorial properties of phylogenetic diversity indices

Kristina Wicke

Mike Steel

(Received: date / Accepted: date)

Abstract

Phylogenetic diversity indices provide a formal way to apportion ‘evolutionary heritage’ across species. Two natural diversity indices are Fair Proportion (FP) and Equal Splits (ES). FP is also called ‘evolutionary distinctiveness’ and, for rooted trees, is identical to the Shapley Value (SV), which arises from cooperative game theory. In this paper, we investigate the extent to which FP and ES can differ, characterise tree shapes on which the indices are identical, and study the equivalence of FP and SV and its implications in more detail. We also define and investigate analogues of these indices on unrooted trees (where SV was originally defined), including an index that is closely related to the Pauplin representation of phylogenetic diversity.

Keywords:

Phylogenetic tree, diversity index, Shapley value, biodiversity measures

††journal: Journal of Mathematical Biology

1 Introduction

Phylogenetic trees play an important role in quantifying biodiversity by estimating how much ‘evolutionary heritage’ is captured by each species and thus how much may be lost due to the current high rates of species extinction. The concept that each extant species caries a combination of unique and shared evolutionary history leads naturally to the notion of a phylogenetic diversity index for each species, which depends on its placement in the underlying phylogenetic tree, which, when summed together (across all species), gives the total diversity of the tree (Redding et al., 2008, 2014; Vellend et al., 2011). For example, the reptile species tuatara, being the sole surviving species from the superorder Lepidosauria, represents 220 million years of unique evolution as traced back to when this species branched off its phylogenetic tree from other lineages that have survived to the present. This species also carries further evolutionary history that is shared with other extant species, and phylogenetic diversity indices quantify not only the unique evolutionary history, but shared history as well.

Methods to apportion the total evolutionary history of life (measured in time or in genetic or trait diversity) across present-day species can be implemented in various ways. In this paper, we explore the mathematical relationship between three closely related indices. Two of these indices – (FP) Fair Proportion (Redding, 2003) and (ES) Equal Splits (Redding and Mooers, 2006) – were described for rooted trees, while a third, the Shapley Value (SV), from cooperative game theory, was initially introduced for unrooted trees (Haake et al., 2008). Soon afterwards it was shown that SV on rooted trees is actually equivalent to FP (Fuchs and Jin, 2015) (see also Stahn (2017)). These and other related indices, have been incorporated into the EDGE initiative by the Zoological Society of London (Isaac et al., 2007) to quantify the expected loss of evolutionary history associated with different endangered species.

The structure of this paper is as follows. We first review some basic definitions, then define two of the indices (FP and ES). Next, we consider how different FP and ES can be from each other. We do this first by considering their ratios (FP/ES and ES/FP) to obtain concise exact results (Theorem 2.1) which apply regardless of whether or not a molecular clock assumption is imposed. As a simple example of how these results apply, consider all rooted binary phylogenetic trees that classify (say) 20 species at their leaves and all possible assignments of edge lengths. It is then possible for the ES index of a species to be up to 9 times larger (but no more) than the FP index for that species; on the other hand, the FP index of a species can be up to 13,797 times larger (but no more) than the ES index of that species.

We then consider how large the differences FP $-$ ES and ES $-$ FP can be, where now we need to bound some aspect of the tree length—either the longest edge length (Theorem 2.2) or the total length of the tree (Theorem 2.3). Companion results are also derived for molecular clock trees. In Theorem 3.1, we characterise the set of trees for which FP and ES are identical, and Section 4 provides a proof that SV is uniquely characterized by four axioms on trees, by using the equivalence of FP and SV. In Section 5, we consider variants of FP and ES defined on unrooted trees and establish a number of results for these measures. We end by highlighting some questions for future work.

1.1 Rooted trees and phylogenetic diversity indices

In this section and the next we deal with rooted phylogenetic $X$ –trees. A rooted tree $T$ with leaf set $X$ is said to be a (rooted) phylogenetic $X$ –tree if each non-leaf vertex is unlabelled and has out-degree at least 2 (two such trees are considered identical if there is a graph isomorphism between them that sends leaf $x$ to leaf $x$ for each $x\in X$ ). In the case where all of the non-leaf vertices have out-degree 2, we say that the tree is binary; we will mostly work with this class in these two sections. Background on the basic combinatorics of phylogenetic trees can be found in Steel (2016). For the rest of this paper we will take, without loss of generality, the leaf set $X$ of trees to be $X=[n]=\{1,\ldots,n\}$ , where $n\geq 2$ .

Throughout this section, let $T$ be a rooted binary phylogenetic tree with root $\rho$ and leaf set $[n]$ , where each edge $e$ is assigned a non-negative length $l(e)$ . Let $L=L(T,l)=\sum\limits_{e}l(e)$ be the total sum of edge lengths of $T$ (see Figure 1(a)).

Any function $\varphi_{T}:[n]\rightarrow\mathbb{R}$ such that $\sum_{i\in[n]}\varphi_{T}(i)=L(T,l)$ is called a phylogenetic diversity index or PD index for short. If $\varphi_{T}(i)$ can be written as a linear function on the edge lengths of $T$ , i.e.

[TABLE]

for coefficients $\gamma_{T}(i,e)$ that are independent of $l(e)$ , we call $\varphi_{T}$ a linear diversity index. In this paper, we will consider three linear PD indices, namely the Fair Proportion index, the Equal Splits index and the Shapley value. Note that an arbitrary function $\varphi_{T}$ of the form described in Eqn. (1) is a diversity index if and only if the following linear equations hold for the coefficients $\gamma_{T}(i,e)$ , for each edge $e$ of $T$ :

[TABLE]

1.2 Fair Proportion and Equal Splits

The Fair Proportion (FP) index (Redding, 2003) for leaf $i\in[n]$ (also called ‘evolutionary distinctiveness’) is defined as:

[TABLE]

where $P(T;\rho,i)$ denotes the path in $T$ from the root to leaf $i$ , $l(e)$ is the length of edge $e$ and $n(e)$ is the number of leaves descended from $e$ . Essentially, the FP index distributes each edge length evenly among its descendant leaves. Note that as the order of summation in the definition of the FP index does not matter, we will often reverse the order and go from leaf $i$ to the root, since this is common biological practice. As an example, for the tree $T$ shown in Fig. 1(a) and the leaf $i=1$ , we have $FP_{T}(i)=\frac{1}{1}+\frac{1}{2}+\frac{1}{3}=\frac{11}{6}$ .

A second natural index is the Equal Splits (ES) index (Redding and Mooers, 2006), where each edge length is distributed evenly at each branching point. It is defined as:

[TABLE]

where $\Pi(e,i)=1$ if $e$ is a pendant edge incident with $i$ ; otherwise, if $e=(u,v)$ is an interior edge, then $\Pi(e,i)$ is the product of the out-degrees of the interior vertices on the directed path from $v$ to leaf $i$ . Since we will be dealing with binary trees in this paper, $\Pi(e,i)$ is 2 raised to the power of the number of edges between $e$ and leaf $i$ . As an example, for the tree $T$ shown in Fig. 1(a), and the leaf $i=1$ , we have $ES_{T}(i)=\frac{1}{1}+\frac{1}{2}+\frac{1}{4}=\frac{7}{4}$ (where we have again reversed the order of summation).

Both FP and ES are linear diversity indices (in particular, $\sum_{i\in[n]}FP_{T}(i)=\sum_{i\in[n]}ES_{T}(i)=L(T,l)$ ). This is easy to see for FP but is less obvious for ES (it suffices to show that Eqn. (2) holds, which is given by Lemma 2 later in this paper). In general, $FP_{T}(i)\neq ES_{T}(i)$ , with Figure 1(a) providing a simple example. This raises the question of how different FP and ES can be, and under which circumstances they coincide. Although there have been some simulation studies to compare the two indices on various trees and taxon choices (Redding et al., 2008, 2014), in the first part of this paper, we determine the largest difference possible between one index and the other (both in relative terms and for absolute differences), and also considering the differences when the edge lengths are constrained to be ‘clock-like’ or not. In particular, rather than considering how different these indices might be ‘on average’ or for a particular tree with particular edge lengths, we study how different they can be for rooted trees in the most extreme cases.

2 How different can FP and ES be?

In this section, we investigate the maximal difference (across all binary trees with $n$ leaves and all edge lengths, and all leaf choices) between the Fair Proportion index and the Equal Splits index (and vice versa), both in terms of their ratios and their absolute values. Before proceeding, we introduce some further notation that will be helpful in the arguments that follow. Let $RB(n)$ , $n\geq 2$ , denote the set of all binary rooted phylogenetic trees on leaf set $[n]$ .

Notice that each pair $(T,i)$ , where $T$ in $RB(n)$ , $i\in[n]$ is a leaf of $T$ , gives rise to a uniquely defined directed path $e_{h},\ldots,e_{0}$ from the root $\rho$ of $T$ to leaf $i$ . We will let $n_{j}$ denote the number of leaves descended from the endpoint of $e_{j}$ closest to the leaves. Thus, $n_{0}=1$ and $n_{j}\geq j+1$ for all $j>0$ . In addition, when the edge $e_{j}$ has an associated non-negative length $l(e_{j})$ , we will let $l_{j}$ denote this length. We will use this notation throughout this paper. In the case where $n_{j}=j+1$ for all $1\leq j\leq h$ and $n_{h}=n-1$ (i.e. when each of the pendant subtrees in Fig. 2 has just one leaf), then $T$ is said to be a rooted caterpillar tree, with $i$ in its cherry (a cherry is a pair of leaves adjacent to the same vertex). Note that a tree in $RB(n)$ is a caterpillar if and only if it has exactly one cherry.

We will also occasionally consider a further ‘molecular clock’ condition on the edge lengths:

(MC) The sum of the edge lengths from the tree root to leaf $i$ takes the same value for each leaf $i$ .

This condition applies, for example, if the edge lengths correspond to time, and all the leaves at the tree are sampled at the same time (e.g. at the present; cf. Figure 1(a)).

2.1 Maximal ratios

We first consider how large the FP can be relative to ES (i.e. as a ratio), as well as the ratio of ES to FP. Let

[TABLE]

and

[TABLE]

where (here and below) ‘sup’ refers to supremum (over all assignments $l$ of edge lengths that are positive).

In words, $\Delta_{n}(FP/ES)$ measures the largest possible ratio of the FP index to the ES index across all binary trees with $n$ leaves, all choices of leaf $i$ , and all assignments of strictly positive edge lengths. Similarly, $\Delta_{n}(ES/FP)$ measures the analogous extreme value for the ratio of ES to FP. Throughout this paper, we impose strictly positive edge lengths (in taking the supremum), in order to avoid any ambiguity as to whether an edge in a tree with a zero length edge should be contracted (this causes a discontinuity for the ES value), and to avoid any issues associated with fractions of the form $0/0$ .

Our first theorem shows that, in the most extreme case, the ratio of FP to ES grows exponentially with $n$ , whereas the ratio of ES to FP grows only linearly with $n$ .

Theorem 2.1

For $n\geq 3$ :

[TABLE]

Moreover, these results hold if the molecular clock condition (MC) is imposed.

Proof

Our proof makes use of the following classical inequality, due to Cauchy (for details, see Steele (2004), pp. 82). Let $a_{i},b_{i}>0$ be constants for $i=0,1,\ldots,h$ . Then

[TABLE]

For the first ratio (FP/ES), using the notation in Fig. 2, we have:

[TABLE]

and since $n_{j}\geq j+1$ , we have:

[TABLE]

where the second inequality is from (5). Now, the expression on the far right of (6) is maximised (subject to the constraint that $j\leq h\leq n-2$ ) by taking $j=h=n-2$ , which gives:

[TABLE]

To see that this bound can be realised (in the supremum limit), consider a caterpillar tree that has leaf $i$ in its cherry and where the edges on the path from $\rho$ to $i$ have strictly positive edge lengths $\ell^{\prime},\ell,\ldots,\ell$ , respectively (see Fig. 3(a)). In the limit as the ratio $\ell^{\prime}/\ell$ tends to infinity, $\frac{FP_{T}(i)}{ES_{T}(i)}$ converges to $\frac{2^{n-2}}{n-1}$ which, combined with Inequality (7), establishes the first equality in Theorem 2.1. Moreover, it is clear that one can select the other edge lengths in $T$ so that the (MC) condition holds.

For the proof of the second equality in Theorem 2.1, we have:

[TABLE]

By Inequality (5), we have:

[TABLE]

Now, $n_{0}=1$ and for each $j>0$ we have $n_{j}\leq n-(h-j)-1$ . Subject to these constraints, the ratio $\frac{n_{j}}{2^{j}}$ is maximised by setting $n_{1}=n-1$ (with $h=j=1$ ). Thus

[TABLE]

To see that this bound can be realised, let $T\in RB(n)$ be such that the children of the root consist of a leaf $j$ and an interior vertex $v$ , where the children of $v$ consist of leaf $i$ and a subtree of $T$ having $n-2$ leaves. Let the edge between the root and $v$ have length $\ell^{\prime}>0$ and assign length $\ell>0$ to the edge $(v,i)$ (see Fig. 3(b)). In the limit as the ratio $\ell^{\prime}/\ell$ tends to infinity $\frac{ES_{T}(i)}{FP_{T}(i)}$ converges to $\frac{n-1}{2}$ which, combined with Inequality (9), establishes the second part of Theorem 2.1. Again, it is clear that one can select the other edge lengths in $T$ so that the (MC) condition holds.

$\Box$

2.2 Maximal differences in terms of $l_{\rm max}$ .

In this section and the next, we consider the additive difference between $FP_{T}(i)$ and $ES_{T}(i)$ and vice versa for any tree $T\in RB(n)$ and any leaf $i$ of $T$ . These differences can be expressed as follows:

[TABLE]

Note that both sums start at $j=1$ , since for $j=0$ we have $n_{j}=2^{j}=1$ and so the additional term in either sum that would correspond to $j=0$ is zero. Also, in contrast to the ratios considered in the last section, these differences can be arbitrarily large (e.g. multiplying all the edge lengths by a constant $C$ will increase the difference $FP_{T}(i)-ES_{T}(i)$ by $C$ ). Thus we will analyse these maximal differences both in terms of the length of the longest edge of a tree $l_{\rm max}=\max\limits_{e}l(e)$ and in terms of the sum of edge lengths $L=\sum\limits_{e}l(e)$ .

Our second theorem shows how the absolute differences between FP and ES (and vice versa) grow either slowly (logarithmically) or are bounded independent of $n$ . In particular, the absolute difference between FP-ES can be made arbitrarily large (for a fixed value of $l_{{\rm max}}$ ) by increasing the number of taxa; however, ES $-$ FP cannot (it is always bounded above by $l_{{\rm max}}$ regardless of $n$ ). Moreover, if we impose a molecular clock, then FP $-$ ES now becomes bounded above by a constant times $l_{{\rm max}}$ . The situation with absolute differences is thus quite different from that for the ratios FP/ES and ES/FP.

To state the theorem more succinctly, we introduce some additional notation. Let

[TABLE]

In words, $\Delta_{n}(FP-ES;l_{\rm max})$ is the largest possible difference between FP and ES across the set of

•

binary trees $T$ with $n$ leaves, and

•

assignments of positive edge lengths to $T$ that have a maximal edge length $l_{\rm max}$ , and

•

choices of leaf $i$ .

Similarly, let

[TABLE]

Note that $\Delta_{n}(FP-ES;l_{\rm max})=\Delta_{n}(ES-FP;l_{\rm max})=0$ for $n=2,3$ . In the following theorem, we consider the case $n\geq 4$ , and we let $\gamma$ denote the Euler–Mascheroni constant ( $\approx 0.5772$ ), and $o(1)$ denote a term that converges to 0 as $n$ grows.

Theorem 2.2

For each $n\geq 4$ :

(i)

(a)

$\Delta_{n}(FP-ES;l_{\rm max})=l_{\rm max}\cdot\left(\ln n+\gamma-2\right)+o(1)$ .

(b)

$\Delta_{n}(ES-FP;l_{\rm max})<l_{\rm max},$ * *and **

[TABLE] 2. (ii)

If (MC) holds, then $\Delta_{n}(FP-ES;l_{\rm max})<l_{\rm max}\cdot\frac{2}{\ln 2}$ .

Proof of Part (i–a): We first show that a triple $(T,i,l)$ that realizes the quantity $\Delta_{n}(FP-ES;l_{\rm max})$ is a rooted caterpillar tree on $n$ leaves with $i$ being a leaf of the cherry in $T$ , and each edge on the path from the root of $T$ to $i$ having length $l_{\rm max}$ . This is illustrated in Fig. 4(a). Let $e_{j},l_{j}$ and $n_{j}$ be as described in Fig. 2. Let $\delta_{j}=l_{j}\left(\frac{1}{n_{j}}-\frac{1}{2^{j}}\right)$ denote the contribution of edge $e_{j}$ to $FP_{T}(i)-ES_{T}(i)$ (cf. Eqn. (10)). Using only the fact that $T\in RB(n)$ it follows that $n_{j}\geq j+1$ for each $j\geq 0$ and so $\delta_{j}\leq l_{j}\left(\frac{1}{j+1}-\frac{1}{2^{j}}\right)$ . In particular, $\operatorname*{arg\,max}\limits_{n_{j}}\;\left\{l_{j}\left(\frac{1}{n_{j}}-\frac{1}{2^{j}}\right)\right\}\,=\left\{\frac{1}{j+1}\right\}$ , and so $\delta_{j}$ is maximal if and only if $n_{j}=j+1$ . As this holds for all values of $j$ , this immediately implies that the maximal pending subtree of $T$ containing leaf $i$ (call it $t_{1}$ ) has to be a caterpillar tree on $n^{\prime}\leq n-1$ leaves and with $i$ being a leaf of the cherry of this caterpillar. We show that $n^{\prime}=n-1$ (and thus $T$ is a caterpillar) by deriving a contradiction. Suppose that $n^{\prime}<n-1$ . In that case, the two subtrees of $T$ incident with the root of $T$ consist of $t_{1}$ and another subtree (call it $t_{2}$ ) that has two or more leaves. In particular, this implies that $h<n-2$ (i.e. there are less than $n-1$ edges on the path from $i$ to the root of $T$ ). However, as $\frac{1}{j+1}-\frac{1}{2^{j}}\geq 0$ for each $j$ , this would imply that $T$ is not a tree that maximises $\max\limits_{i^{\prime}\,\in\,[n]}\;\sup\limits_{l>0}\;\{FP_{T}(i^{\prime})-ES_{T}(i^{\prime})\}$ , since $FP_{T}(i)-ES_{T}(i)$ could be increased by sequentially attaching all but one leaf from $t_{2}$ to the edge connecting $t_{1}$ and the root (i.e. by extending the length of the path from leaf $i$ to the root of $T$ ). Thus, $n^{\prime}=n-1$ , and therefore $T$ has to be the caterpillar tree on $n$ leaves that has $i$ in its cherry. Moreover, by again invoking the inequality $\frac{1}{j+1}-\frac{1}{2^{j}}\geq 0$ (for all $j\geq 0$ ) and recalling that $\delta_{j}=l_{j}\left(\frac{1}{j+1}-\frac{1}{2^{j}}\right)$ , we can also conclude that $l_{j}=l_{\rm max}$ for all $j$ (as otherwise $\delta_{j}$ and thus, $FP_{T}(i)-ES_{T}(i)$ could be increased). In summary, $(T,i,l)$ has the structure claimed.

It is now straightforward to calculate $\Delta_{n}(FP-ES;l_{\rm max})$ for the optimal choice of $(T,i,l)$ described above. We have:

[TABLE]

Consequently,

[TABLE]

which completes the proof of Part (i–a).

Proof of Part (i–b): From Eqn. (11), we have:

[TABLE]

Thus, $\Delta_{n}(ES-FP;l_{\rm max})<l_{\rm max}.$ To show that $\sup_{n}\Delta_{n}(ES-FP;l_{\rm max})=l_{\rm max},$ let $T_{n}$ be a tree in which the path $P$ from the root to leaf $i$ has $k_{n}=\lfloor\sqrt{n-1}\rfloor$ edges, and each of the subtrees incident with the vertices of $P$ (except the final leaf vertex) has at least $k_{n}$ leaves. Assign edge length $l_{\rm max}$ to each of the edges in $P$ . This is illustrated in Fig. 4(b). Then

[TABLE]

Now, $\lim_{n\rightarrow\infty}\sum\limits_{j=1}^{k_{n}}\frac{1}{2^{j}}=1$ and since $n_{j}\geq j\cdot k_{n}$ , we have:

[TABLE]

as $n\rightarrow\infty$ . Combining this with Eqn. (12) gives: $\lim_{n\rightarrow\infty}ES_{T_{n}}(i)-FP_{T_{n}}(i)=l_{\rm max}$ , as required.

Proof of Part (ii): Let $T\in RB(n)$ and $i\in[n]$ . From Eqn. (10), we have $FP_{T}(i)-ES_{T}(i)=\sum\limits_{j=1}^{h}l_{j}\left(\frac{1}{n_{j}}-\frac{1}{2^{j}}\right).$ We claim that, under condition (MC),

[TABLE]

To establish Inequality (13), the (MC) condition implies that for each leaf $i^{\prime}$ of $T$ descended from the endpoint $v_{j}$ of $e_{j}$ closest to the leaves, the sum of the edge lengths from $v_{j}$ to leaf $i^{\prime}$ is equal to $\sum_{k=0}^{j-1}l_{k}$ . Moreover, each of these edges has length at most $l_{\rm max}$ , which means that the number of edges on this path must be at least $m\coloneqq\lceil\sum_{k=0}^{j-1}l_{k}/l_{\rm max}\rceil$ . Now, for $r\geq 1$ , let $N_{r}$ be the number of vertices descended from $v_{j}$ that are separated from $v_{j}$ by exactly $r$ edges. We then have $N_{r}=2^{r}$ for all $r=1,\ldots,m$ . This follows from an inductive argument. Clearly, $N_{1}=2$ (as $T$ is binary and $v_{j}$ is not a leaf since $j>0$ ). Suppose the statement is true for $1\leq r<m$ and consider $N_{r+1}$ . Each vertex counted by $N_{r}$ must have two children (otherwise there would be a leaf that is separated from $v_{j}$ by less than $m$ edges) and thus $N_{r+1}=2N_{r}=2\cdot 2^{r}=2^{r+1}$ , which completes the inductive step. Now, as all leaves descended from $v_{j}$ are separated by at least $m$ edges from $v_{j}$ , we have $n_{j}\geq N_{m}\geq 2^{m}=2^{\lceil\sum_{k=0}^{j-1}l_{k}/l_{\rm max}\rceil}$ , which completes the proof.

Thus, from Eqn. (13) and Eqn. (10), we have:

[TABLE]

To complete the proof of Part (ii), we require the following lemma, the proof of which is provided in the Appendix.

Lemma 1

Suppose that $x_{0},x_{1},x_{2},\ldots,x_{h}$ all lie in the interval $[0,1]$ . Then

[TABLE]

We apply this lemma by setting $x_{i}=l_{i}/l_{\rm max}$ for $i=0,1,\ldots,h$ . By Inequality (14), we have:

[TABLE]

as required, where the last inequality is from Lemma 1. $\Box$

2.3 Maximal differences in terms of $L$

We now describe the maximal possible (positive and negative) difference between FP and ES in terms of the total length of the tree ( $L=\sum\limits_{e}l(e)$ ), rather than in terms of $l_{\rm max}$ (this is summarized in Theorem 2.3 below). Let

[TABLE]

In words, $\Delta_{n}(FP-ES;L)$ is the largest possible difference between FP and ES across the set of:

•

binary trees $T$ with $n$ leaves, and

•

assignments of positive edge lengths to $T$ for which the total sum of the edge lengths is $L$ , and

•

choices of leaf $i$ .

Similarly, let

[TABLE]

Theorem 2.3

**

(i)

[TABLE]

where

[TABLE]

and for $n\geq 3$ :

[TABLE]

(ii)

If the molecular clock (MC) condition is imposed then the above expressions for $\Delta_{n}(FP-ES;L)$ and $\Delta_{n}(ES-FP;L)$ remain true if $L$ is replaced by $L/2$ .

Proof

For Part (i), we first show that for any given tree $T\in RB(n)$ and any leaf $i$ of $T$ we have:

[TABLE]

Recall from Eqn. (10) that $FP_{T}(i)-ES_{T}(i)=\sum_{j=1}^{h}l_{j}\left(\frac{1}{n_{j}}-\frac{1}{2^{j}}\right)$ , and observe that for $j=0$ , we have $n_{j}=1=2^{j}$ . Thus, in particular, we have:

[TABLE]

Moreover, for any tree $T$ , we always have: $n_{j}\geq j+1$ for each $j\geq 0$ , and therefore

[TABLE]

Let $c_{j}:=\frac{1}{j+1}-\frac{1}{2^{j}}$ for $j\geq 0$ . The sequence $c_{j}$ for $j=0,1,2,\ldots$ begins as follows:

[TABLE]

after which the values in the sequence begin to decline. This establishes Inequality (15), as required.

To show that Inequality (15) is an equality, it suffices to show that for each $n\geq 2$ and every $\epsilon>0$ there exists a tree $T\in RB(n)$ with positive edge lengths and there is a leaf $i$ of $T$ for which $FP_{T}(i)-ES_{T}(i)\geq\lambda_{n}L-\epsilon$ . To this end, let $T_{n}$ be a rooted caterpillar tree with $n$ leaves, let $i$ be a leaf in the cherry of $T_{n}$ , let the interior edge at distance $k=\min\{4,n-2\}$ from leaf $i$ have length $L-\epsilon$ , and the lengths of all the remaining edges of $T_{n}$ have strictly positive lengths that sum to $\epsilon$ . In this case:

[TABLE]

holds for $T=T_{n}$ as required.

We turn now to $\Delta_{n}(ES-FP;L)$ . We first show that for any given tree $T\in RB(n)$ and any leaf $i$ of $T$ :

[TABLE]

From Eqn. (11), we have: $ES_{T}(i)-FP_{T}(i)=\sum_{j=1}^{h}l_{j}\left(\frac{1}{2^{j}}-\frac{1}{n_{j}}\right).$ Now, $\left(\frac{1}{2^{j}}-\frac{1}{n_{j}}\right)$ takes a value that is, at most, $\frac{1}{2}-\frac{1}{n-1}$ for all $j\geq 1$ . Thus:

[TABLE]

as required to establish Inequality (16).

To show that Inequality (16) is an equality it suffices to show that for each $n\geq 3$ , and every $\epsilon>0$ there exists a tree $T\in RB(n)$ with positive edge lengths, and there is a leaf $i$ of $T$ for which

[TABLE]

To this end, let $T\in RB(n)$ be any tree for which the children of the root consist of a leaf $j$ and an interior vertex $v$ , where the children of $v$ consist of a leaf $i$ and a subtree of $T$ having $n-2$ leaves. Let the edge between the root and $v$ have length $L-\epsilon$ and let the remaining edges have strictly positive lengths that sum to $\epsilon$ . Then

[TABLE]

as required.

Part (ii): We now impose the (MC) condition. For $\Delta_{n}(FP-ES;L)$ , observe that our proof of Inequality (15) invoked the inequality $\sum_{j=0}^{h}l_{j}\leq L$ . When (MC) holds, we have a tighter bound of the sum, namely $\sum_{j=0}^{h}l_{j}\leq L/2$ since there is at least one other leaf $k$ of $T$ for which the path from the root of $T$ to $k$ also has length $\sum_{j=0}^{h}l_{j}$ (by (MC)) and is edge-disjoint from the path from $\rho$ to $i$ (thus $2\sum_{j=0}^{h}l_{j}\leq L$ ). In this way, we claim that

[TABLE]

when (MC) holds.

To show that this inequality holds it suffices to show that for each $n\geq 2$ , and every $\epsilon>0$ there exists a tree $T_{n}\in RB(n)$ with positive edge lengths, and there is a leaf $i$ of $T$ for which:

[TABLE]

where $O(\epsilon)$ is a term that tends to zero as $\epsilon\rightarrow 0$ . This trivially holds for $n=2,3$ (indeed it holds for $\epsilon=0$ ); while for $n=4,5,6$ let $T_{n}$ be a caterpillar tree with $i$ being a leaf in its cherry. Let $(\rho,v)$ and $(\rho,j)$ denote the two edges incident with the root $\rho$ of $T$ , where $j$ is a leaf. Assign the edge $(\rho,v)$ length $L/2-5\epsilon/2$ and the edge $(\rho,j)$ length $L/2-\epsilon$ . We then assign the path from $v$ to $i$ and from $v$ to its adjacent leaf (which exists since it is a caterpillar) a length of $3\epsilon/2$ . Now adjust the remaining edge lengths so they sum to $\epsilon/2$ and so that the (MC) condition holds for $T$ (see Fig. 5(a) for the case $n=6$ ). This assignment then satisfies Inequality (17) for $T=T_{n}$ , as required.

For the case $n>6$ , let $T^{\prime}_{n}$ be obtained from $T_{6}$ (in the previous argument) by replacing leaf $j$ by an arbitrary rooted binary subtree with $n-5$ leaves with root $v^{\prime}$ . Assign length $L/2-5\epsilon/2$ to each of the two edges ( $(\rho,v)$ and $(\rho,v^{\prime})$ ) that are incident with the root. Set the length of the path from $v$ to leaf $i$ , and the length of the path from $v$ to its adjacent leaf to equal $\epsilon$ , and set the length of each of two disjoint paths from $v^{\prime}$ to some pair of descendant leaves also equal to $\epsilon$ (see Fig. 5(b)). Finally, select edge lengths within these two subtrees so as to maintain the (MC) condition and so that the sum of the lengths of the additional edges added to these two subtrees is $\epsilon$ . In this way, the (MC) condition holds for the tree, $L=2(L/2-5\epsilon/2)+2(2\epsilon)+\epsilon$ equals the sum of the edge lengths, and Inequality (17) holds for $T=T_{n}^{\prime}$ , as required.

We now establish Part(ii) for the quantity $\Delta_{n}(ES-FP;L)$ . The argument for the inequality $\Delta_{n}(ES-FP;L)\leq\left(\frac{1}{2}-\frac{1}{n-1}\right)L/2$ when (MC) holds is identical to the corresponding inequality for $\Delta_{n}(FP-ES;L)$ under (MC). Moreover, to show that this inequality can be realised, consider again the tree $T\in RB(n)$ described in the previous paragraph, to which we will assign similar but modified edge lengths (we can assume that $n\geq 4$ , since the equality holds when $n=3$ ). For the edge between the root and $v$ , assign length $L/2-\epsilon$ ; for the edge between the root and leaf $j$ , assign length $L/2-2\epsilon/5$ ; for the edge $(v,i)$ assign length $3\epsilon/5$ and assign the lengths of the remaining edges so that they sum to $4\epsilon/5$ and are chosen so as to satisfy (MC) (this is possible, since we are assuming that $n\geq 4$ ). In this way, the total sum of edge lengths is $L$ and the path length from the root to each leaf takes the same value (namely, $L/2-2\epsilon/5$ ), and the result of Part (ii) for $\Delta_{n}(ES-FP;L)$ now follows.

$\Box$

3 For which tree shapes do FP and ES coincide?

In the following, we will analyse for which tree shapes FP and ES coincide. Therefore, recall that a rooted binary tree $T$ can be decomposed into its two maximal pending subtrees $T^{\prime}$ and $T^{\prime\prime}$ rooted at the direct descendants of the root. We denote this by writing $T=(T^{\prime},T^{\prime\prime})$ (note that the order of $T^{\prime}$ and $T^{\prime\prime}$ is not important, thus $T=(T^{\prime},T^{\prime\prime})=(T^{\prime\prime},T^{\prime})$ ). Now, let $T$ be a binary tree with $n=2^{h}$ leaves, in which each leaf is separated from the root by a path of precisely $h$ edges. We call this (unique shape) tree the fully balanced tree of height $h$ and denote it by $T_{h}^{fb}$ . Note that we have $T_{h}^{fb}=(T_{h-1}^{fb},T_{h-1}^{fb})$ , i.e. both maximal pending subtrees of a fully balanced tree of height $h$ are fully balanced trees of height $h-1$ . Using the notation of Fig. 2 it is thus easy to see that for a leaf $i$ of $T^{fb}_{h}$ and an edge $e_{j}$ on the path from the root of $T_{h}^{fb}$ to leaf $i$ we always have: $n_{j}=2^{j}$ . It is now not difficult to show that FP and ES coincide (for all choices of reference leaf $i$ ) on any fully balanced tree. However, there are other tree shapes for which FP and ES coincide (e.g. the tree $T^{\prime}$ in Fig. 1(b)). Therefore, let $T^{sb}$ be a rooted binary tree, whose two maximal pending subtrees $T^{\prime}$ and $T^{\prime\prime}$ are both fully balanced trees of height $h^{\prime}$ and $h^{\prime\prime}$ , respectively (where $h^{\prime}$ and $h^{\prime\prime}$ are not necessarily identical), i.e. $T^{sb}=(T^{fb}_{h^{\prime}},T^{fb}_{h^{\prime\prime}})$ . We call such a tree a semi-balanced tree. Then,

Theorem 3.1

Let $T$ be a rooted binary phylogenetic tree on taxon set $[n]$ and non-negative edge lengths $l(e)$ . Then, we have: $ES_{T}(i)=FP_{T}(i)$ for all $i\in[n]$ and all assignments of positive edge lengths if and only if $T$ is a semi-balanced tree.

Proof

We first show that if $T$ is a semi-balanced tree (i.e. $T=T^{sb}$ ) we have $ES_{T}(i)=FP_{T}(i)$ for all $i\in[n]$ . Therefore, let $T^{fb}_{h^{\prime}}$ and $T^{fb}_{h^{\prime\prime}}$ denote the two maximal pending subtrees of $T$ . Recall that

[TABLE]

As both sums just run over edges on the path from the root to leaf $i$ , $FP_{T}(i)$ and $ES_{T}(i)$ are independent of $T^{fb}_{h^{\prime\prime}}$ if $i\in T^{fb}_{h^{\prime}}$ and vice versa. Let $i$ be a leaf of $T^{fb}_{h^{\prime}}$ . As $T^{fb}_{h^{\prime}}$ is a fully balanced tree, we have $n_{j}=2^{j}$ for all $j=1,\ldots,h^{\prime}$ , and thus, using Eqn. (10), we immediately have

[TABLE]

(i.e. $FP_{T}(i)=ES_{T}(i)$ ). Analogously, this holds for all leaves of $T^{fb}_{h^{\prime\prime}}$ , so $ES_{T}(i)=FP_{T}(i)$ for all $i\in[n]$ .

Now suppose that $FP_{T}(i)=ES_{T}(i)$ for all $i\in[n]$ . By way of contradiction assume that $T=(T^{\prime},T^{\prime\prime})$ is not a semi-balanced tree, i.e. assume that at least one of the maximal pending subtrees of $T$ , say $T^{\prime}$ , is not a fully balanced tree. This implies that there exists an interior vertex $v$ in $T^{\prime}$ with the following two properties:

(i)

For the subtree $T_{v}=(T^{\prime}_{v},T^{\prime\prime}_{v})$ rooted at $v$ we have: $n^{\prime}_{v}\neq n^{\prime\prime}_{v}$ , where $n^{\prime}_{v}$ and $n^{\prime\prime}_{v}$ denote the number of leaves of $T^{\prime}_{v}$ and $T^{\prime\prime}_{v}$ , respectively. 2. (ii)

$v$ is chosen so that $T_{v}$ is a minimal subtree of $T^{\prime}$ satisfying property (i) (in the sense that there exists no subtree $T_{w}$ of $T^{\prime}$ on fewer leaves that has this property).

In particular, this implies that both maximal pending subtrees of $T_{v}$ are fully balanced trees. Without loss of generality we may assume that $n^{\prime}_{v}>n^{\prime\prime}_{v}$ (otherwise exchange the roles of $T^{\prime}_{v}$ and $T^{\prime\prime}_{v}$ ), in which case $h^{\prime}_{v}>h^{\prime\prime}_{v}$ .

Now, for a leaf $i$ and an edge $e$ of $T$ , we use $\delta_{e}^{FP}(i)$ and $\delta_{e}^{ES}(i)$ to denote the contribution of edge $e$ to $FP_{T}(i)$ , respectively $ES_{T}(i)$ , where

[TABLE]

Let $\Delta^{FP}(i)=\sum_{e\in T_{v}}\delta_{e}^{FP}(i)$ and $\Delta^{ES}(i)=\sum_{e\in T_{v}}\delta_{e}^{ES}(i)$ . Now, as both maximal pending subtrees of $T_{v}$ are fully balanced trees, we can use the first part of the proof to conclude that for each $i\in T_{v}$ : $\Delta^{FP}(i)=\Delta^{ES}(i)$ and we denote this common value by $\Delta(i)$ .

Now, let $l_{1},l_{2},\ldots,l_{h}$ be the lengths of the edges $e_{1},e_{2},\ldots,e_{h}$ on the path from vertex $v$ to the root and let $n_{j}$ be the number of leaves descended from edge $e_{j}$ . Let $i^{\prime}$ be a leaf of $T^{\prime}_{v}$ and let $i^{\prime\prime}$ be a leaf of $T^{\prime\prime}_{v}$ . By assumption, $FP_{T}(i)=ES_{T}(i)$ for all $i\in[n]$ , and so we have

[TABLE]

In particular

[TABLE]

However, as $h^{\prime}_{v}>h^{\prime\prime}_{v}$ and $l_{j}>0$ for all $j$ , this is a contradiction. A similar argument yields a contradiction for the assumption that $T^{\prime\prime}$ is not a fully balanced tree. Thus, $T$ has to be a semi-balanced tree, which completes the proof.

$\Box$

4 Uniqueness of SV for phylogenetic tree games

Another linear PD index frequently used is the so-called Shapley value (SV), which originates from cooperative game theory. Recall that a cooperative game is a pair $([n],\nu)$ consisting of a set of players $[n]=\{1,\ldots,n\}$ and a characteristic function $\nu:2^{[n]}\rightarrow\mathbb{R}$ that assigns a real value to all subsets of $[n]$ with $\nu(\emptyset)=0$ . A function $\varphi_{\nu}:[n]\rightarrow\mathbb{R}$ that assigns a payoff to each player is called a value for the game. One such value is the Shapley value (Shapley (1953)), which is defined as follows:

[TABLE]

Note that the Shapley value of a player $i$ reflects the average marginal contribution of $i$ to the game. Moreover, it is characterised by the following four axioms:

Pareto efficiency: $\sum_{i\in[n]}\varphi_{\nu}(i)=\nu([n])$ . 2. 2.

Symmetry: $\forall\,i,j$ with $i\neq j$ and $\forall\,C\subseteq[n]\setminus\{i,j\}$ , if $\nu(C\cup\{i\})=\nu(C\cup\{j\})$ , then $\varphi_{\nu}(i)=\varphi_{\nu}(j)$ . 3. 3.

Dummy axiom: If $\forall\,C\subseteq[n]\setminus\{i\}$ , $\nu(C\cup\{i\})=\nu(C)$ , then $\varphi_{\nu}(i)=0$ . 4. 4.

Additivity: $\forall\,\nu_{1},\nu_{2},\forall i\in[n],\,\varphi_{\nu_{1}+\nu_{2}}(i)=\varphi_{\nu_{1}}(i)+\varphi_{\nu_{2}}(i).$

In fact, the Shapley value is the unique value satisfying these four axioms.

Theorem 4.1

The Shapley value is the unique value satisfying Axioms 1–4 (Shapley (1953); Winter (2002)).

Note that the formulation described here is slightly different from the original formulation in Shapley (1953). On the one hand, Shapley (1953) used a framework consisting of three axioms: symmetry, additivity, and a carrier axiom, the latter comprising both Pareto efficiency and the dummy axiom (see Winter (2002) for details). On the other hand, Shapley (1953) made the additional assumption that $\nu$ is a superadditive function (i.e. $\nu(A\cup B)\geq\nu(A)+\nu(B)$ for all pairs of disjoint sets $A,B$ ), which was later relaxed by Dubey (1975).

In the phylogenetic setting, $\nu(S)$ is taken to be the phylogenetic diversity of $S$ on $T$ 111Note that PD is not a superadditive function. In fact, it is submodular, satisfying the property that $\nu(A\cup B)\leq\nu(A)+\nu(B)-\nu(A\cap B)$ for all $A,B$ (cf. Proposition 6.13 in Steel (2016))., denoted by $PD_{{\color[rgb]{0,0,0}T}}(S)$ , and defined as the sum of lengths of the edges in the minimal subtree of $T$ that contains $S$ and the root of $T$ (cf. Faith (1992)). As an example, for the tree $T^{\prime}$ depicted in Fig. 1(b), and the subset $S=\{1,2,4\}$ of leaves, we have $PD_{{\color[rgb]{0,0,0}T^{\prime}}}(S)=11$ .

Considering the leaf set $[n]$ of a rooted phylogenetic tree $T$ as the set of players and phylogenetic diversity as the characteristic function of a game, Eqn. (18) becomes:

[TABLE]

Note that in contrast to the previous two sections we are not assuming in this section that $T$ is a binary tree.

In an important paper, Fuchs and Jin (2015) proved that the Shapley value and the Fair Proportion index on rooted phylogenetic trees agree (see also Steel (2016) and Stahn (2017)).

Theorem 4.2 (Fuchs and Jin (2015))

The Fair Proportion index and the Shapley value are identical on rooted phylogenetic trees, i.e. for all $i\in X$ :

[TABLE]

In the following we will use this result to show that SV is the unique value satisfying Axioms 1–4 for the sub-class of games induced by a rooted tree $T$ and the phylogenetic diversity function. This is not obvious since (as noted by Haake et al. (2008) in the setting of PD on unrooted trees), the class of games based on PD on a rooted tree is smaller than the class of all games (for which Theorem 4.1 states that SV is unique). Apart from SV there might be other functions that satisfy these 4 axioms for this smaller class of games, and so SV might not be uniquely determined by them. In Theorem 4.3, however, we show that SV is still uniquely characterised by the 4 axioms for this smaller class of games. Haake et al. (2008), by contrast, introduced an additional axiom to obtain their characterization (Theorem 9 of that paper).

Let $\mathcal{T}_{[n],PD_{T}}$ denote the class of games induced by a rooted phylogenetic tree $T$ with leaf set $[n]$ and non-negative edge lengths, and the phylogenetic diversity function on $T$ . Moreover, let a pair $([n],PD_{T})$ denote a PD game. Note that such a pair can be represented as a linear combination of so-called basis games $PD_{T_{e}}$ (for $e\in E(T)$ ), where $PD_{T_{e}}$ corresponds to the $PD$ game on tree $T_{e}$ , in which edge $e$ has length 1 and all other edges have length 0. It can be shown that the family ${\color[rgb]{0,0,0}(PD_{T_{e}})}_{e\in E(T)}$ is linearly independent and forms a basis of $\mathcal{T}_{[n],PD_{T}}$ of dimension $|E(T)|$ .

The following theorem provides an axiomatic characterization of SV for games in $\mathcal{T}_{[n],PD_{T}}$ .

Theorem 4.3

There is a unique function

[TABLE]

that satisfies Axioms 1–4. This function coincides with the Shapley value, i.e. ${\color[rgb]{0,0,0}\psi_{PD_{T}}}(i)=SV_{T}(i)$ for all $i\in[n]$ .

Proof

By Theorem 4.1, SV satisfies Axioms 1–4.

Now, let $([n],PD_{T})$ be a PD game and let ${\color[rgb]{0,0,0}\psi_{PD_{T}}}$ satisfy all Axioms 1–4. We first consider a basis game $PD_{T_{e}}$ and determine ${\color[rgb]{0,0,0}\psi_{PD_{T_{e}}}}$ .

Let $N(e)$ denote the set of leaves descended from $e$ and let $n(e)=|N(e)|$ . Then, all leaves not in $N(e)$ are dummy players, as for all $j\in[n]\setminus N(e)$ , we have that $PD_{T_{e}}(C\cup\{j\})=PD_{T_{e}}(C)$ for all $C\subseteq[n]\setminus\{j\}$ . As $\psi_{PD_{T_{e}}}$ satisfies the dummy axiom, this implies that $\psi_{PD_{T_{e}}}(j)=0$ for all $j\in[n]\setminus N(e)$ . On the other hand, all leaves in $N(e)$ are symmetric players as for any pair $i,j\in N(e)$ (with $i\neq j$ ), we have that $PD_{{\color[rgb]{0,0,0}T_{e}}}(C\cup\{i\})=PD_{{\color[rgb]{0,0,0}T_{e}}}(C\cup\{j\})=1$ holds for all subsets $C$ of $[n]\setminus\{i,j\}$ . As this holds for all pairs $i,j\in N(e)$ and as $\psi_{{\color[rgb]{0,0,0}PD_{T_{e}}}}$ satisfies symmetry, we can conclude that $\psi_{{\color[rgb]{0,0,0}PD_{T_{e}}}}(i)=\psi_{{\color[rgb]{0,0,0}PD_{T_{e}}}}(j)$ for all $i\neq j\in N(e)$ . On the other hand, since $\psi_{{\color[rgb]{0,0,0}PD_{T_{e}}}}$ satisfies efficiency, we have

[TABLE]

which – using symmetry – implies that $\psi_{{\color[rgb]{0,0,0}PD_{T_{e}}}}(i)=\frac{1}{n(e)}$ for all $i\in N(e)$ . To summarize, $\psi_{{\color[rgb]{0,0,0}PD_{T_{e}}}}(i)=\frac{1}{n(e)}$ for all $i\in N(e)$ and $\psi_{{\color[rgb]{0,0,0}PD_{T_{e}}}}(j)=0$ for all $j\in[n]\setminus N(e)$ . It is easily verified that these values coincide with the FP index and thus with the SV (by Theorem 4.2).

Analogously, one can show that $\psi_{PD_{T}}$ is a linear function. As $\psi_{PD_{T}}$ satisfies Axiom 4, it is additive. Moreover, for all $\lambda\in\mathbb{R}_{\geq 0}$ , let $\lambda\cdot PD_{T_{e}}$ denote the PD game on tree $T_{e}$ , in which edge $e$ has length $\lambda\cdot 1=\lambda$ and all other edges have length 0. Then, using the same notation and reasoning as above, we have for all $i\in[n]\setminus N(e)$ , $\psi_{\lambda PD_{T_{e}}}(i)=0$ , and for all $i\in N(e)$ , $\psi_{\lambda PD_{T_{e}}}(i)=\lambda/n(e)$ . Comparing this with $\psi_{PD_{T_{e}}}(i)$ from above, it is now easy to see that we have $\psi_{\lambda\,PD_{T_{e}}}(i)=\lambda\cdot\psi_{PD_{T_{e}}}(i)$ for all $\lambda\in\mathbb{R}_{\geq 0}$ and all $i\in[n]$ .

Together with the additivity of $\psi$ and SV this implies that $\psi$ coincides with SV for all games in $\mathcal{T}_{[n],PD_{T}}$ .

Remark 1

Since SV is the unique index satisfying Pareto efficiency, symmetry, the dummy axiom and additivity for the class of games induced by a rooted tree and PD (by Theorem 4.3), and since SV and FP agree for rooted trees (by Theorem 4.2) and, in general, ES $\neq$ FP, it follows that ES must violate at least one of these four axioms. It can easily be checked that ES satisfies Pareto efficiency, additivity and the dummy axiom, but it may violate symmetry. An example is given in Figure 6, where we have $ES_{T}(1)\neq ES_{T}(3)$ , even though $PD_{{\color[rgb]{0,0,0}T}}(C\cup\{1\})=PD_{{\color[rgb]{0,0,0}T}}(C\cup\{3\})$ for all $C\subseteq[4]\setminus\{1,3\}$ (on the other hand, $FP_{T}(1)=FP_{T}(3)$ ).

5 Diversity indices for unrooted trees

We now consider phylogenetic diversity indices for unrooted trees. An unrooted tree $T$ with leaf set $X$ is said to be an unrooted phylogenetic $X$ –tree if each non-leaf vertex is unlabelled and has degree at least 3 (two such trees are considered equivalent if there is a graph isomorphism between them that sends leaf $x$ to leaf $x$ for each $x\in X$ ). In the case where all non-leaf vertices in $T$ have degree exactly equal to 3, $T$ is said to be binary. Background on the basic combinatorics of unrooted phylogenetic trees can be found in Steel (2016).

Let $T$ be an unrooted phylogenetic tree (not necessarily binary) with leaf set $X=[n]$ and let all edges $e$ have non-negative edge lengths $l(e)$ . For a subset $Y$ of the leaves, the (unrooted) phylogenetic diversity of $Y$ is defined as the sum of the edge lengths of the minimal subtree connecting the leaves in $Y$ . Note that $PD_{{\color[rgb]{0,0,0}T}}(\{i\})=0$ for all $i\in[n]$ and $PD_{{\color[rgb]{0,0,0}T}}([n])$ is the total sum of edge lengths of $T$ (i.e $\sum_{e}l(e)$ ). For a leaf $i\in[n]$ and an edge $e$ of $T$ , let $I(T;i,e)$ be the set of interior vertices of $T$ in the path in $T$ from $i$ to edge $e$ (including the first vertex of $e$ that is reached, but not the second), and for each vertex $v$ of $T$ let $d(v)$ denote the degree of $v$ . For each edge $e$ of $T$ , let

[TABLE]

where we adopt the convention that if $I(T;i,e)=\emptyset$ (i.e. $e$ is a pendant edge incident with leaf $i$ ) then $\prod\limits_{v\in I(T;i,e)}\frac{1}{d(v)-1}=1$ and hence $\mu(i,e)=1/2$ .

5.1 Unrooted Equal Splits

In this section, we develop a version of Equal Splits for unrooted trees. Recall that for rooted trees, the definition of the ES index is $ES_{T}(i)=\sum\limits_{e\in P(T;\rho,i)}\frac{1}{\Pi(e,i)}l(e)$ , where $\Pi(e,i)=1$ if $e$ is a pendant edge incident with $i$ ; otherwise, if $e=(u,v)$ is an interior edge, then $\Pi(e,i)$ is the product of the out-degrees of the interior vertices on the directed path from $v$ to leaf $i$ .

This definition does not directly apply to unrooted trees, since there is no reference root vertex $\rho$ in an unrooted tree. Moreover, introducing a phantom root vertex in an unrooted tree results in different ES index values, depending on where the phantom root is inserted. Nevertheless, we can define a canonical unrooted version of ES that is a diversity index as follows.

Let

[TABLE]

where the summation is over all edges of $T$ and where

[TABLE]

Note that $\mu(i,e)$ is the expression introduced in Eqn. (19). Moreover, note that in contrast to the rooted setting, $\varphi_{\rm ES}(i)$ is defined as a sum over all edges of $T$ and not only over edges on a certain path in $T$ . In fact, even though pendant edges not incident with leaf $i$ do not contribute to $\varphi_{\rm ES}(i)$ (since $\mu(i,e)=0$ in that case), the edges that do contribute do not necessarily form a path in $T$ (cf. Fig. 7).

Theorem 5.1

For any unrooted phylogenetic tree $T$ , $\varphi_{\rm ES}$ is a diversity index for $T$ . In other words:

[TABLE]

In order to prove this theorem, we require the following technical lemmas:

Lemma 2

Suppose that $T$ is a rooted phylogenetic tree with leaf set $Y$ and root vertex $u$ . Let $d^{-}(v)$ denote the out-degree of vertex $v$ . We then have:

[TABLE]

Proof

We use a simple probabilistic argument. Consider a random walk, starting from the root vertex $u$ and proceeding towards the leaves. At each interior vertex $v$ , one of the $d^{-}(v)$ child vertices of $v$ is chosen uniformly at random (and independently of earlier choices). In this way, the probability $p_{i}$ of arriving at leaf $i$ is simply $\prod_{v\in I(T;u,i)}\frac{1}{d^{-}(v)}$ . Since we always arrive at one (and only one) leaf of $Y$ by this process, $\sum_{i\in Y}p_{i}=1$ , as required.

$\Box$

Corollary 1

Let $T$ be an unrooted phylogenetic tree with leaf set $[n]$ and let $e=\{u,v\}$ be an arbitrary edge of $T$ . Let $A$ and $B$ denote the subsets of leaves of $T$ that lie on each side of $e$ , with $A$ being closer to $u$ (if $u$ is a leaf, then $A=\{u\}$ ) and $B$ being closer to $v$ (again, if $v$ is a leaf, then $B=\{v\}$ ). In this case:

[TABLE]

Proof

Clearly, we only have to prove the first statement, so consider $i\in A$ . If $|A|=1$ (which implies $i=u$ ), $I(T;i,e)=\emptyset$ , and we again adopt the convention that in this case $\prod\limits_{v\in I(T;i,e)}\frac{1}{d(v)-1}=1$ . In particular, the claimed statement holds for $|A|=1$ . Next, consider $|A|>1$ . Then the expression

[TABLE]

can also be written as:

[TABLE]

where $T_{A}$ is the rooted phylogenetic tree on leaf set $A$ and root vertex $u$ obtained from $T$ by deleting edge $e$ and the subtree of $T$ with leaf set $B$ . The corollary now follows from Lemma 2 by taking $T_{A}$ as the tree in that lemma, and $Y=A$ . Note that the statement can alternatively be shown without the use of Lemma 2 by using an inductive argument.

$\Box$

Lemma 3

Suppose that the linear equation

[TABLE]

with $a(e),b(e)\in{\mathbb{R}}$ , holds for all choices of $l$ of the form $l=l_{e^{\prime}}$ where $e^{\prime}\in E$ and

[TABLE]

Eqn. (22) then holds for all choices of $l\in{\mathbb{R}}^{E}$ .

Proof

The proof involves simple linear algebra. Let $c(e):=a(e)-b(e)$ . Eqn. (22) can then be rewritten as $\sum_{e\in E}c(e)l(e)=0$ . Suppose this equation holds whenever $l=l_{e^{\prime}}$ (and for each choice of $e^{\prime}$ ). Then this equation becomes $c(e^{\prime})\cdot 1=0$ , and since this holds for all choices of $e^{\prime}$ , all the $c$ –coefficients are zero, which gives the result.

$\Box$

We are now in the position to prove Theorem 5.1.

Proof of Theorem 5.1: By Lemma 3, it suffices to establish Eqn. (20) when $l$ assigns length 1 to an arbitrary edge $e^{\prime}=\{u,v\}$ and 0 to all other edges. Then $PD_{{\color[rgb]{0,0,0}T}}([n])=1$ and the left hand side of Eqn. (20) is $\sum_{i\in[n]}\mu_{\rm ES}(i,e^{\prime})$ . Our aim then is to show that this last quantity is always equal to $\sum_{i\in[n]}\mu_{\rm ES}(i,e^{\prime})=1$ . This is true by definition of $\mu_{\rm ES}$ when $e^{\prime}$ is a pendant edge, so we may suppose that $e^{\prime}$ is an interior edge. In that case, let $A$ and $B$ denote the subsets of leaves of $T$ that lie on each side of $e^{\prime}$ , with $A$ being closer to $u$ than $v$ and $B$ being closer to $v$ than $u$ (thus $A\cup B=[n]$ , $A\cap B=\emptyset$ and $|A|,|B|\geq 2$ ). Since $\mu_{\rm ES}(i,e^{\prime})=\mu(i,e^{\prime})$ (since $e^{\prime}$ is an interior edge) we have:

[TABLE]

and

[TABLE]

where the last equality follows from Corollary 1. A similar argument shows that $\sum_{i\in B}2\mu(i,e^{\prime})=1$ , and so, by Eqn. (23), we obtain the required equality:

[TABLE]

$\Box$

5.2 A diversity index related to the Pauplin representation of phylogenetic diversity

$PD_{{\color[rgb]{0,0,0}T}}([n])$ can also be expressed as a positive linear combination of the pairwise distances $d(i,j)=\sum_{e\in P(T;i,j)}l(e)$ between leaves $i$ and $j$ in various ways, one of them being the following representation described by Semple and Steel (2004):

[TABLE]

where

[TABLE]

and where $I(T;i,j)$ denotes the set of interior vertices on the path from $i$ to $j$ in $T$ .

Although this representation holds for general trees (not only binary ones), for binary trees, this expression is also known as the Pauplin representation of phylogenetic diversity (cf. Pauplin (2000)). In the following section, we will further analyse this representation and suggest that it leads to yet another possible unrooted PD index. Let

[TABLE]

where the summation is over all edges of $T$ and $\mu(i,e)$ is the expression introduced in Eqn. (19).

Theorem 5.2

Let $T$ be an unrooted phylogenetic tree with leaf set $[n]$ and let $i$ be a leaf of $T$ . In that case:

[TABLE]

In other words, $\varphi_{\rm Pa}$ is closely related to the Pauplin representation of PD given in Eqn. (24). Moreover, $\varphi_{\rm Pa}$ is a diversity index (i.e. $\sum_{i\in[n]}\varphi_{\rm Pa}(i)=PD_{{\color[rgb]{0,0,0}T}}([n])$ ).

Proof

Let $i\in[n]$ be a leaf of $T$ . By Lemma 3 it suffices to establish Eqn. (25) when $l$ assigns length 1 to an arbitrary edge $e^{\prime}=\{u,w\}$ and 0 to all other edges. Note that the removal of edge $e^{\prime}$ splits $T$ into two subtrees. Let $C$ (=‘close’) denote the leaf set of the subtree that contains leaf $i$ and let $F$ (=‘far’) denote the leaf set of the other subtree. Now, for all leaves $j\neq i$ we clearly have:

[TABLE]

Thus, we have for the right-hand side of Equation (25)

[TABLE]

As $e^{\prime}=\{u,w\}$ lies on the path from $i$ to $j$ , the term on the right of this last equation can also be written as:

[TABLE]

where the last equality follows from applying Corollary 1. On the other hand, for the left-hand side of Equation (25), we have:

[TABLE]

as edge $e^{\prime}=\{u,w\}$ has length 1, while all other edges have length 0, which completes the proof of Eqn. (25). The claim that $\varphi_{\rm Pa}$ is a diversity index is now a direct consequence from Eqn. (24).

$\Box$

5.3 Unrooted Fair Proportion

Similar to the Equal Splits index, the Fair Proportion index has so far only been considered for rooted trees. In the following, we suggest two canonical extensions of Fair Proportion to unrooted trees. Recall that for rooted trees, the definition of FP is $FP_{T}(i)=\sum\limits_{e\in P(T;\rho,i)}\frac{1}{n(e)}l(e)$ , where $n(e)$ is the number of leaves descended from $e$ . Note that the removal of edge $e$ splits $T$ into two connected components and $n(e)$ is the number of leaves of $T$ in the connected component that contains $i$ . This concept can be extended to unrooted trees as follows.

For a leaf $i\in[n]$ and an edge $e$ of $T$ , let $c(i,e)$ denote the size of the set of leaves that lie on the same side of $e$ as $i$ . Let

[TABLE]

and let

[TABLE]

where the summation is over all edges of $T$ and where

[TABLE]

Theorem 5.3

For any unrooted phylogenetic tree $T$ , $\varphi_{\rm FP}$ and $\tilde{\varphi}_{\rm FP}$ are diversity indices for $T$ . In other words,

[TABLE]

Proof

We first establish Eqn. (26). By Lemma 3, it suffices to establish Eqn. (26) when $l$ assigns length 1 to an arbitrary edge $e^{\prime}=\{u,v\}$ and 0 to all other edges. Then, $PD_{{\color[rgb]{0,0,0}T}}([n])=1$ and the left hand side of Eqn. (26) is $\frac{1}{2}\sum_{i\in[n]}\frac{1}{c(i,e^{\prime})}$ . Now, let $A$ and $B$ denote the subsets of leaves that lie on each side of $e^{\prime}$ (i.e. $A\cup B=[n]$ , $A\cap B=\emptyset$ and $|A|,\,|B|\geq 1$ ), in which case:

[TABLE]

Eqn. (27) follows from a similar argument by noting that the left hand side of this equation becomes $\sum_{i\in[n]}\mu_{\rm FP}(i,e^{\prime})$ . If $e^{\prime}$ is a pendant edge, this quantity is equal to 1 by definition of $\mu_{\rm FP}$ and if $e^{\prime}$ is an interior edge, the same reasoning as in the proof of Eqn. (26) establishes $\sum_{i\in[n]}\mu_{\rm FP}(i,e^{\prime})=1=PD_{{\color[rgb]{0,0,0}T}}([n])$ .

$\Box$

5.4 Summary of unrooted diversity indices

In the last sections we have presented canonical extensions of Equal Splits and Fair Proportion to unrooted trees and have also introduced a diversity index closely related to the Pauplin representation of phylogenetic diversity. Although all these indices appear to be new, an unrooted Shapley value has long been known in the literature. In fact, even though the Shapley value is frequently used for rooted trees, it was first defined and introduced for unrooted trees by Haake et al. (2008) and can be expressed as follows:

[TABLE]

where the summation is over all edges of $T$ , $c(i,e)$ is again the number of leaves that lie on the same side of $e$ as leaf $i$ , and $f(i,e)$ is the number of leaves that lie on the other side of $e$ (cf. Theorem 4 in Haake et al. (2008)). Recall that for rooted trees, FP and SV are equivalent, so one might argue that the unrooted SV can be considered an unrooted analogue of FP. It turns out, however, that there exists a natural extension of FP to unrooted trees, that is different from unrooted SV.

In fact, although all of the unrooted diversity indices discussed above can be expressed as linear functions of the edge lengths $l(e)$ of $T$ with coefficients that are independent of $l$ , these coefficients differ among indices (cf. Table 1) and the indices are, in general, not equivalent (cf. Figure 7).

6 Concluding Remarks

Phylogenetic diversity indices play a key role in biodiversity, so it is helpful to understand how the different indices are related. In this paper, we asked just how different they can be for rooted trees (in an extreme sense, rather than on average). We also considered how some of the natural indices in the rooted settings extend to the unrooted setting, and further explored the way in which the Shapley value relates to rooted and unrooted indices. Our work suggests two broad questions that may be interesting to explore in future work. First, how do the results in Sections 2 and 3 extend if we lift the assumption that the underlying trees are binary? Second, for the unrooted indices in Section 5, how different can they be from one another (in the sense we considered in Section 2) and for which trees are certain indices identical (in the sense we considered in Section 3)? Moreover, as all unrooted indices apart from the unrooted SV appear to be new, it additionally might be of interest to analyse their biological interpretation and relevance for conservation decisions.

7 Acknowledgements

We thank Arne Mooers for a number of helpful suggestions, and the two anonymous reviewers for detailed comments on an earlier version of this manuscript. We also thank François Bienvenu for pointing out an alternative proof of Lemma 1, and Mareike Fischer for helpful comments concerning Section 4. The first author also thanks the German Academic Scholarship Foundation for a doctoral scholarship.

Appendix: Proof of Lemma 1

Proof of Lemma 1: We first establish the following identity by application of the ‘fundamental theorem of calculus’. Let $f:[0,h]\rightarrow[0,1]$ be any continuous function and let $c>0$ . We then have:

[TABLE]

To establish (28), let $G(x)=\exp(-c\int_{0}^{x}f(t)dt)$ . Since $f$ is continuous, $G^{\prime}(x)=-cf(x)G(x)$ , so the left-hand side of Eqn. (28) can be written as $\frac{-1}{c}\int_{1}^{h}G^{\prime}(x)dx=\frac{1}{c}(G(1)-G(h)),$ which gives Eqn. (28).

Now, for all $x\geq 1$ , $\int_{0}^{x-1}f(t)dt\geq\int_{0}^{x}f(t)dt-1$ , since $f$ takes values in the interval $[0,1]$ , and thus (28) gives:

[TABLE]

Taking $c=\ln(2)$ in this last inequality gives:

[TABLE]

Let $g$ be a piecewise continuous function that takes the value $x_{i}$ on the open interval $(i,i+1)$ , for each $i=0,1,\ldots,h-1$ , and let $f_{j},j\geq 1$ , be a sequence of continuous functions that converges in the $L^{2}$ norm to $g$ (e.g. by Fourier series). As $j\rightarrow\infty$ , $\int_{1}^{h}f_{j}(x)\cdot e^{-c\int_{0}^{x-1}f_{j}(t)dt}dx$ then converges to $\sum_{i=1}^{h}x_{i}2^{-\sum_{j<i}x_{i}}$ and $\frac{2}{\ln 2}\cdot 2^{-\int_{0}^{1}f_{j}(t)dt}$ converges to $\frac{2}{\ln 2}\cdot 2^{-x_{0}}$ . Inequality (29) now establishes the lemma. $\Box$

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Dubey (1975) Dubey, P., 1975. On the uniqueness of the Shapley value. International Journal of Game Theory 4, 131–139. URL: https://doi.org/10.1007/BF 01780630 , doi: 10.1007/BF 01780630 .
2Faith (1992) Faith, D.P., 1992. Conservation evaluation and phylogenetic diversity. Biological Conservation 61, 1–10. URL: http://dx.doi.org/10.1016/0006-3207(92)91201-3 , doi: 10.1016/0006-3207(92)91201-3 .
3Fuchs and Jin (2015) Fuchs, M., Jin, E.Y., 2015. Equality of Shapley value and fair proportion index in phylogenetic trees. Journal of Mathematical Biology 71, 1133–1147.
4Haake et al. (2008) Haake, C.J., Kashiwada, A., Su, F.E., 2008. The Shapley value of phylogenetic trees. Journal of Mathematical Biology 56, 479–497. URL: http://dx.doi.org/10.1007/s 00285-007-0126-2 , doi: 10.1007/s 00285-007-0126-2 .
5Isaac et al. (2007) Isaac, N., Turvey, S.T., Collen, B., Waterman, C., Baillie, J., 2007. Mammals on the EDGE: Conservation priorities based on threat and phylogeny. P Lo S One 2, e 296.
6Pauplin (2000) Pauplin, Y., 2000. Direct calculation of a tree length using a distance matrix. Journal of Molecular Evolution 51, 41–47. doi: 10.1007/s 002390010065 .
7Redding (2003) Redding, D.W., 2003. Incorporating genetic distinctness and reserve occupancy into a conservation priorisation approach. Master’s thesis. University Of East Anglia, Norwich, UK.
8Redding et al. (2008) Redding, D.W., Hartmann, K., Mimoto, A., Bokal, D., De Vos, M., Mooers, A.Ø., 2008. Evolutionarily distinctive species often capture more phylogenetic diversity than expected. Journal of Theoretical Biology 251, 606–615. doi: 10.1016/j.jtbi.2007.12.006 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Combinatorial properties of phylogenetic diversity indices

Abstract

Keywords:

1 Introduction

1.1 Rooted trees and phylogenetic diversity indices

1.2 Fair Proportion and Equal Splits

2 How different can FP and ES be?

2.1 Maximal ratios

Theorem 2.1

Proof

2.2 Maximal differences in terms of lmaxl_{\rm max}lmax​.

Theorem 2.2

Lemma 1

2.3 Maximal differences in terms of LLL

Theorem 2.3

Proof

3 For which tree shapes do FP and ES coincide?

Theorem 3.1

Proof

4 Uniqueness of SV for phylogenetic tree games

Theorem 4.1

Theorem 4.2 (Fuchs and Jin (2015))

Theorem 4.3

Proof

Remark 1

5 Diversity indices for unrooted trees

5.1 Unrooted Equal Splits

Theorem 5.1

Lemma 2

Proof

Corollary 1

Proof

Lemma 3

Proof

5.2 A diversity index related to the Pauplin representation of phylogenetic diversity

Theorem 5.2

Proof

5.3 Unrooted Fair Proportion

Theorem 5.3

Proof

5.4 Summary of unrooted diversity indices

6 Concluding Remarks

7 Acknowledgements

Appendix: Proof of Lemma 1

2.2 Maximal differences in terms of $l_{\rm max}$ .

2.3 Maximal differences in terms of $L$