Exchangeable and Sampling Consistent Distributions on Rooted Binary   Trees

Ben Hollering; Seth Sullivant

arXiv:1902.03321·math.CO·March 6, 2019

Exchangeable and Sampling Consistent Distributions on Rooted Binary Trees

Ben Hollering, Seth Sullivant

PDF

Open Access

TL;DR

This paper explores the structure of distributions on rooted binary trees that are both exchangeable and sampling consistent, revealing their geometric properties and introducing new models like the multinomial model.

Contribution

It characterizes the set of exchangeable, sampling consistent distributions as polytopes and introduces the multinomial model for these distributions.

Findings

01

The set of such distributions on n leaves forms a polytope.

02

The infinite sampling consistent distributions on 4 leaves correspond to Aldous' beta-splitting model.

03

A new semialgebraic set called the multinomial model is introduced.

Abstract

We introduce a notion of finite sampling consistency for phylogenetic trees and show that the set of finitely sampling consistent and exchangeable distributions on n leaf phylogenetic trees is a polytope. We use this polytope to show that the set of all exchangeable and infinite sampling consistent distributions on 4 leaf phylogenetic trees is exactly Aldous' beta-splitting model and give a description of some of the vertices for the polytope of distributions on 5 leaves. We also introduce a new semialgebraic set of exchangeable and sampling consistent models we call the multinomial model and use it to characterize the set of exchangeable and sampling consistent distributions.

Figures3

Click any figure to enlarge with its caption.

Equations137

π_{n} (p_{m}) (T) = p_{n}^{m} (T) = {S \in R B_{L} (m) ∣ T = S ∣_{[n]}} \sum p_{m} (S) .

π_{n} (p_{m}) (T) = p_{n}^{m} (T) = {S \in R B_{L} (m) ∣ T = S ∣_{[n]}} \sum p_{m} (S) .

p_{T} (S) = {\frac{1}{∣ O ( T ) ∣} 0 s ha p e (S) = T s ha p e (S) \neq = T .

p_{T} (S) = {\frac{1}{∣ O ( T ) ∣} 0 s ha p e (S) = T s ha p e (S) \neq = T .

p = T \in R B_{U} (n) \sum p (T) \cdot ∣ O (T) ∣ \cdot p_{T}

p = T \in R B_{U} (n) \sum p (T) \cdot ∣ O (T) ∣ \cdot p_{T}

π_{n} (p_{m}) (T) = {S \in R B_{L} (m) ∣ T = S ∣_{[n]}} \sum p_{m} (S)

π_{n} (p_{m}) (T) = {S \in R B_{L} (m) ∣ T = S ∣_{[n]}} \sum p_{m} (S)

E X_{n}^{m} = π_{n} (E X_{m}) .

E X_{n}^{m} = π_{n} (E X_{m}) .

E X_{n}^{m} \subseteq E X_{n}^{k} .

E X_{n}^{m} \subseteq E X_{n}^{k} .

E X_{n}^{\infty} := \cap_{m = n}^{\infty} E X_{n}^{m} .

E X_{n}^{\infty} := \cap_{m = n}^{\infty} E X_{n}^{m} .

E X_{n}^{m} = conv ({π_{n} (p_{T}) : T \in R B_{U} (m)}) .

E X_{n}^{m} = conv ({π_{n} (p_{T}) : T \in R B_{U} (m)}) .

π_{n} (p_{m}) (S) = {Q \in R B_{L} (m) ∣ S = Q ∣_{[n]}} \sum T \in R B_{U} (m) \sum p_{m} (T) \cdot ∣ O (T) ∣ \cdot p_{T} (Q)

π_{n} (p_{m}) (S) = {Q \in R B_{L} (m) ∣ S = Q ∣_{[n]}} \sum T \in R B_{U} (m) \sum p_{m} (T) \cdot ∣ O (T) ∣ \cdot p_{T} (Q)

π_{n} (p_{m}) (S) = T \in R B_{U} (m) \sum p_{m} (T) \cdot ∣ O (T) ∣ {Q \in R B_{L} (m) ∣ S = Q ∣_{[n]}} \sum p_{T} (Q)

π_{n} (p_{m}) (S) = T \in R B_{U} (m) \sum p_{m} (T) \cdot ∣ O (T) ∣ {Q \in R B_{L} (m) ∣ S = Q ∣_{[n]}} \sum p_{T} (Q)

π_{n} (p_{m}) (S) = T \in R B_{U} (m) \sum p_{m} (T) \cdot ∣ O (T) ∣ \cdot π_{n} (p_{T}) (S)

π_{n} (p_{m}) (S) = T \in R B_{U} (m) \sum p_{m} (T) \cdot ∣ O (T) ∣ \cdot π_{n} (p_{T}) (S)

π_{n} (p_{T}) (S) = {Q \in R B_{L} (m) ∣ S = Q ∣_{[n]}} \sum p_{T} (Q)

π_{n} (p_{T}) (S) = {Q \in R B_{L} (m) ∣ S = Q ∣_{[n]}} \sum p_{T} (Q)

π_{n} (p_{T}) (S) = {Q \in R B_{L} (m) ∣ S = Q ∣_{[n]}, s ha p e (Q) = T} \sum \frac{1}{∣ O ( T ) ∣} = \frac{c _{T} ( S )}{∣ O ( T ) ∣} .

π_{n} (p_{T}) (S) = {Q \in R B_{L} (m) ∣ S = Q ∣_{[n]}, s ha p e (Q) = T} \sum \frac{1}{∣ O ( T ) ∣} = \frac{c _{T} ( S )}{∣ O ( T ) ∣} .

π_{n} (p_{T}) (S^{'}) = S \in O (S^{'}) \sum \frac{c _{T} ( S )}{∣ O ( T ) ∣}

π_{n} (p_{T}) (S^{'}) = S \in O (S^{'}) \sum \frac{c _{T} ( S )}{∣ O ( T ) ∣}

q_{n}(i)=a_{n}^{-1}\bigg{(}\binom{n}{i}\int_{0}^{1}x^{i}(1-x)^{n-i}\nu(dx)+nc1_{i=1}\bigg{)}

q_{n}(i)=a_{n}^{-1}\bigg{(}\binom{n}{i}\int_{0}^{1}x^{i}(1-x)^{n-i}\nu(dx)+nc1_{i=1}\bigg{)}

f (x) = \frac{Γ ( 2 β + 2 )}{Γ ^{2} ( β + 1 )} x^{β} (1 - x)^{β}

f (x) = \frac{Γ ( 2 β + 2 )}{Γ ^{2} ( β + 1 )} x^{β} (1 - x)^{β}

q_{n} (i) = a_{n}^{- 1} (i n) \frac{Γ ( β + i + 1 ) Γ ( β + n - i + 1 ) Γ ( 2 β + 2 )}{Γ ( β + n + 2 ) Γ ^{2} ( β + 1 )}

q_{n} (i) = a_{n}^{- 1} (i n) \frac{Γ ( β + i + 1 ) Γ ( β + n - i + 1 ) Γ ( 2 β + 2 )}{Γ ( β + n + 2 ) Γ ^{2} ( β + 1 )}

q_{n} (i) = \frac{( i n ) ( i + β ) _{i} ( n - i + β ) _{n - i}}{( n + 2 β + 1 ) _{n} - 2 ( n + β ) _{n}}

q_{n} (i) = \frac{( i n ) ( i + β ) _{i} ( n - i + β ) _{n - i}}{( n + 2 β + 1 ) _{n} - 2 ( n + β ) _{n}}

p (C o m b_{4})

p (C o m b_{4})

p (B a l_{4})

q_{n - 1} (i) = \frac{( n - i ) q _{n} ( i ) + ( i + 1 ) q _{n} ( i + 1 )}{n - 2 q _{n} ( 1 )}

q_{n - 1} (i) = \frac{( n - i ) q _{n} ( i ) + ( i + 1 ) q _{n} ( i + 1 )}{n - 2 q _{n} ( 1 )}

p_{A} = (m _{A} n) e \in T \prod t_{e}^{m_{A} (e)}

p_{A} = (m _{A} n) e \in T \prod t_{e}^{m_{A} (e)}

p_{T, t} (S) = A \in M_{n}^{T} T_{A} = S \sum p_{A} .

p_{T, t} (S) = A \in M_{n}^{T} T_{A} = S \sum p_{A} .

p_{T, t} (B a l_{5}) = A \in M_{5}^{T} T_{A} = S \sum p_{A} .

p_{T, t} (B a l_{5}) = A \in M_{5}^{T} T_{A} = S \sum p_{A} .

p_{T, t} (B a l_{5}) = (3 , 2 5) t_{2}^{3} t_{3}^{2} + (2 , 3 5) t_{2}^{2} t_{3}^{3}

p_{T, t} (B a l_{5}) = (3 , 2 5) t_{2}^{3} t_{3}^{2} + (2 , 3 5) t_{2}^{2} t_{3}^{3}

p_{T} : Δ_{∣ E (T) ∣ - 1} \to E X_{n}^{\infty}

p_{T} : Δ_{∣ E (T) ∣ - 1} \to E X_{n}^{\infty}

q_{4} (2)

q_{4} (2)

m_{2, 2^{n}} = i = 1 \sum n - 1 2^{n - i - 1} (2 2 ^{i})^{2}

m_{2, 2^{n}} = i = 1 \sum n - 1 2^{n - i - 1} (2 2 ^{i})^{2}

\frac{m _{2, 2^{n}}}{( 4 2 ^{n} )} = \frac{3 ( 2 ^{n} ) - 5}{7 ( 2 ^{n} ) - 21}

\frac{m _{2, 2^{n}}}{( 4 2 ^{n} )} = \frac{3 ( 2 ^{n} ) - 5}{7 ( 2 ^{n} ) - 21}

b_{5} (T^{'}) = (2 i) (3 n - i) + (3 i) (2 n - i) .

b_{5} (T^{'}) = (2 i) (3 n - i) + (3 i) (2 n - i) .

c_{5} (T) - c_{5} (T^{'}) = (c_{5} (T_{z}) - c_{5} (T_{z}^{'})) + n_{0} (c_{4} (T_{z}) - c_{4} (T_{z}^{'}))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Bayesian Modeling and Causal Inference · Markov Chains and Monte Carlo Methods

Full text

Exchangeable and Sampling Consistent Distributions on Rooted Binary Trees

Benjamin Hollering and Seth Sullivant

Abstract.

We introduce a notion of finite sampling consistency for phylogenetic trees and show that the set of finitely sampling consistent and exchangeable distributions on $n$ leaf phylogenetic trees is a polytope. We use this polytope to show that the set of all exchangeable and infinite sampling consistent distributions on 4 leaf phylogenetic trees is exactly Aldous’ beta-splitting model and give a description of some of the vertices for the polytope of distributions on 5 leaves. We also introduce a new semialgebraic set of exchangeable and sampling consistent models we call the multinomial model and use it to characterize the set of exchangeable and sampling consistent distributions.

1. Introduction

Leaf-labelled binary trees, which are commonly called phylogenetic trees, are frequently used to represent the evolutionary relationships between species. In this paper we will restrict our attention to rooted binary trees and our label set for a tree with $n$ leaves will always be $[n]=\{1,2,\ldots n\}$ and call such trees $[n]$ -trees, the set of which we denote $RB_{L}(n)$ .

Processes for generating random $[n]$ -trees play an important role in phylogenetics. Two common examples are the uniform distribution (where a tree is chosen uniformly at random from among all trees in $RB_{L}(n)$ ) and the Yule-Harding distribution (a simple Markov branching process). Some other examples of random tree models include Aldous’ $\beta$ -splitting model [1], the $\alpha$ -splitting model [8], and the coalescent process (which generates trees with edge lengths) [16]. Two features common to all these random tree processes and desirable for any such tree process is that they are exchangeable and sampling consistent.

Let $p_{n}$ denote a probability distribution on $RB_{L}(n)$ . Exchangeability refers to the fact that relabeling the leaves of the tree does not change its probability. That is, for all $T\in RB_{L}(n)$ and $\sigma\in S_{n}$ , $p_{n}(T)=p_{n}(\sigma T)$ . Exchangeability is a natural condition since it does not allow the names of the species to play any special role in the probability distribution. A family of distributions, $\{p_{n}\}_{n=2}^{\infty}$ , on trees has sampling consistency if for each $n$ , the distribution $p_{n}$ , which is on $[n]$ -trees, can be realized as the marginalization of distributions $p_{m}$ , which is on $[m]$ -trees, for $m>n$ . That is the probability of a $[n]$ -tree, $T$ , under $p_{n}$ can be written as

[TABLE]

Sampling consistency is a natural condition for a random tree model because it means that randomly missing species do not affect the underlying distribution on the species that were observed.

The goal of this paper is to study the structure of finitely sampling consistent distributions on rooted binary trees. In particular, we aim to obtain a finite deFinetti-type theorem for these trees in the style of Diaconis’ Theorem 4 in [6]. Our motivation is two-fold. First of all, there has been significant work on understanding the set of exchangeable, sampling consistent distributions on other discrete objects, including rooted trees. A classic result in this theory is deFinetti’s Theorem for infinitely exchangeable sequences of binary random variables which shows that every subsequence of the infinite sequence can be expressed as a mixture of independent and identically distributed sequences. This does not hold for finitely exchangeable sequences but Diaconis later developed a finite form of deFinetti’s theorem. He showed that if a finite exchangeable sequence of binary random variables, $\{X_{i}\}_{i=1}^{n}$ , can be extended to an exchangeable sequence, $\{X_{i}\}_{i=1}^{m}$ where $m>n$ , then the original sequence can be approximated with a mixture of independent and identically distributed sequences with error $O(\frac{1}{m})$ [6]. A substantial amount of work has been done on exchangeable arrays (see [7] for example) as well, which has been used to prove deFinetti theorems for other discrete objects. For instance, Lauritzen, Rinaldo, and Sadeghi recently developed a deFinetti Theorem for exchangeable random networks [12].

As previously mentioned, there has already been considerable work characterizing exchangeable and sampling consistent distributions on trees using weighted real trees as limit objects in [9, 10, 11]. In [11] a characterization of the exchangeable and sampling consistent Markov branching models we discuss in Section 3.1 is obtained. A true deFinetti theorem for trees is conjectured in [10] and proven in Theorem 3 of [9]. The approach taken in these papers is to characterize all infinitely sampling consistent distributions on trees using a tree-limit like object called a weighted real tree. In this paper, we instead take a geometric and combinatorial approach to the study of exchangeable and finitely sampling consistent distributions on binary trees and examine what happens as we take the limit.

A second motivation comes from the combinatorial phylogenetics problem of studying properties of the distribution of the maximum agreement subtree of pairs of random trees. Let $T\in RB_{L}(n)$ and $S\subseteq[n]$ . The restriction tree $T|_{S}$ is the rooted binary tree with leaf label set $S$ obtained by removing all leaves of $T$ not in $S$ and suppressing all vertices of degree 2 except the root. Two trees, $T_{1},T_{2}\in RB_{L}(n)$ , agree on a set $S\subseteq[n]$ if $T_{1}|_{S}=T_{2}|_{S}$ . A maximum agreement set is an agreement set of the largest size for $T_{1}$ and $T_{2}$ . The size of a maximum agreement subtree of these two trees is the cardinality of the largest subset $S$ that $T_{1}$ and $T_{2}$ agree on and is denoted $MAST(T_{1},T_{2})$ . If $S$ is an agreement set with $|S|=MAST(T_{1},T_{2})$ then the resulting tree $T_{1}|_{S}=T_{2}|_{S}$ is a maximum agreement subtree of $T_{1}$ and $T_{2}$ .

Understanding the distribution of $MAST(T_{1},T_{2})$ for random tree distributions would help in conducting hypothesis tests that the similarity between the trees is no greater than the similarity between random trees. For example, it was suggested in [5] that $MAST(T_{1},T_{2})$ could be used to test the hypothesis that no cospeciation occurred between a family of host species and a family of parasite species that prey on them. The study of the distribution of $MAST(T_{1},T_{2})$ for random trees $T_{1},T_{2}$ is primarily conducted with the assumption that $T_{1}$ and $T_{2}$ are drawn from an exchangeable, sampling consistent distribution on rooted binary trees. Bryant, Mackenzie, and Steel began the study of the distribution of $MAST(T_{1},T_{2})$ and obtained some first bounds on $\mathbb{E}(MAST(T_{1},T_{2}))$ for random trees $T_{1}$ and $T_{2}$ drawn from the Uniform or Yule-Harding distributions [3]. Later work on the distribution obtained an upper bound on the order of $O(\sqrt{n})$ for $\mathbb{E}(MAST(T_{1},T_{2}))$ when $T_{1}$ and $T_{2}$ are drawn from any exchangeable, sampling consistent distribution [2]. A lower bound on the order of $\Omega(\sqrt{n})$ has been conjectured for all exchangeable, sampling consistent distributions as well but this remains an open problem. Our hope in pursuing this project is that developing a better understanding of the set of all exchangeable sampling consistent distributions might shed light on this conjecture.

In this paper we study the structure of exchangeable, sampling consistent distributions on leaf labelled, rooted binary trees. We introduce a notion of a polytope of exchangeable and finitely sampling consistent distributions. We use it to study the set of exchangeable and sampling consistent distributions on trees and get some characterizations for trees with a small number of leaves. We show that set of all exchangeable and sampling consistent distributions on four leaf trees come from the $\beta$ -splitting model that was first introduced by Aldous in [1]. We have not been able to find a similar characterization for exchangeable and sampling consistent distributions on five leaf trees but we describe some of the vertices of the polytope of exchangeable and finitely sampling consistent distributions. We also introduce a new exchangeable and sampling consistent model on trees, called the multinomial model, and show that every sampling consistent and exchangeable distribution can be realized as a convex combination of limits of sequences of multinomial distributions.

2. Exchangeability and Finite Sampling Consistency

In this section we describe how the set of exchangeable distributions relates to the set of all distributions on leaf labelled, rooted binary trees. We then introduce a notion of finite sampling consistency and discuss how it relates to traditional sampling consistency.

Recall that $RB_{L}(n)$ denotes the set of all leaf labelled, rooted binary trees with label set $[n]$ , which we call $[n]$ -trees, and that $|RB_{L}(n)|=(2n-3)!!$ . The set of all distributions on $RB_{L}(n)$ is the probability simplex $\Delta_{(2n-3)!!-1}\subseteq\mathbb{R}^{(2n-3)!!}$ where the coordinates are indexed by $[n]$ -trees. The symmetric group $S_{n}$ denotes the group of permutations of $[n]$ . For each $\sigma\in S_{n}$ and $T\in RB_{L}(n)$ let $\sigma T$ denote the tree obtained by applying $\sigma$ to the leaf labels.

Definition 2.1.

A distribution $p$ on $RB_{L}(n)$ is exchangeable if for all permutations $\sigma\in S_{n}$ and $[n]$ -trees $T\in RB_{L}(n)$ , $p(T)=p(\sigma T)$ . The set of all exchangeable distributions on $RB_{L}(n)$ is denoted $EX_{n}$ .

As previously mentioned, exchangeability requires that the probability of a $[n]$ -tree under a particular distribution depend only on the shape of the tree. Thus we only need to consider distributions on the set of tree shapes. Let $RB_{U}(n)$ denote the set of unlabelled rooted binary trees, which we may also call trees or tree shapes. This idea is summarized in the next lemma which is the $[n]$ -tree analogue of Lemma 2 in [12].

Lemma 2.2.

The set of exchangeable distributions on $RB_{L}(n)$ , $EX_{n}$ , is a simplex of dimension $|RB_{U}(n)|-1$ with coordinates indexed by tree shapes.

Proof.

First we define a distribution $p_{T}\in EX_{n}$ for each tree shape $T\in RB_{U}(n)$ . To do so, we let $O(T)$ be the set of trees $T^{\prime}\in RB_{L}(n)$ such that $\mathrm{shape}(T^{\prime})=T$ . For any tree $S\in RB_{L}(n)$ we set

[TABLE]

Then $p_{T}\in EX_{n}$ since it is a probability distribution on trees and all trees of the same shape have the same probability. We claim that $EX_{n}=\mathrm{conv}\left(\{p_{T}:T\in RB_{U}(n)\}\right)$ , where $\mathrm{conv}(A)$ denotes the convex hull of the set $A$ . Since $p_{T}\in EX_{n}$ for all $T\in RB_{U}(n)$ , it is enough to show that any distribution $p\in EX_{n}$ can be written as a convex combination of the $p_{T}$ . If $p\in EX_{n}$ , then the probability of any tree $T^{\prime}\in RB_{L}(n)$ depends only on the shape of $T^{\prime}$ not the leaf labelling so we can write

[TABLE]

where $p(T)$ represents the probability of any $[n]$ -tree in $RB_{L}(n)$ with shape $T$ . Since the original $p$ is a probability distribution on all leaf labelled trees the weights in the linear combination are nonnegative and sum to $1$ .

Lastly we note that the vectors $p_{T}$ are affinely independent since there is no overlap of coordinate indices where the entries in $p_{T}$ are nonzero. So $EX_{n}=\mathrm{conv}\left(\{p_{T}:T\in RB_{U}(n)\}\right)$ is a simplex and has coordinates indexed by $RB_{U}(n)$ . ∎

Lemma 2.2 allows us to move from studying exchangeable distributions on leaf labelled $[n]$ -trees to all distributions on unlabelled trees. We will primarily focus on understanding the set of sampling consistent distributions within $EX_{n}$ now. First recall that for $p_{m}\in EX_{m}$ the marginalization or projection map $\pi_{n}$ , gives a new distribution $p_{n}^{m}$ on $RB_{L}(n)$ for $n<m$ , defined for all $T\in RB_{L}(n)$ by

[TABLE]

We will use this marginalization map to define a notion of finite sampling consistency.

Definition 2.3.

A family of distributions $\{p_{k}\}_{k=n}^{m}$ is finitely sampling consistent or $m$ -sampling consistent, if for each $n\leq k<m$ , $p_{k}=\pi_{k}(p_{m})$ . We denote the set of all distributions in $EX_{n}$ that are $m$ -sampling consistent by

[TABLE]

It is immediate that if a distribution in $EX_{n}$ is $m$ -sampling consistent, then for any $k$ , such that $n<k<m$ , the distribution is also $k$ -sampling consistent. This leads to the following:

Lemma 2.4.

For all $m>k>n$ ,

[TABLE]

A distribution in $EX_{n}$ , is sampling consistent if it is part of a $m$ -sampling consistent family of distributions for all $m>n$ . In other words, a distribution is sampling consistent if it is in $EX_{n}^{m}$ for all $m>n$ . Thus we can define the following notation for the set of exchangeable distributions on $RB_{L}(n)$ that are sampling consistent:

[TABLE]

Lemma 2.5.

Let $p_{T}\in EX_{m}$ be defined as it is in Lemma 2.2, then

[TABLE]

Proof.

Clearly it holds that $\mathrm{conv}(\{\pi_{n}(p_{T}):T\in RB_{U}(m)\})\subseteq EX_{n}^{m}$ since $\pi_{n}(p_{T})\in EX_{n}^{m}$ for all $T\in RB_{U}(m)$ . It is enough to show that if we have a distribution $p_{n}^{m}\in EX_{n}^{m}$ , then it can be written as a convex combination of the $\pi_{n}(p_{T})$ . If $p_{n}^{m}\in EX_{n}^{m}$ , then there exists $p_{m}\in EX_{m}$ such that $\pi_{n}(p_{m})=p_{n}^{m}$ . Since $p_{m}\in EX_{m}$ , we know from Lemma 2.2 that we can write $p_{m}=\sum_{T\in RB_{U}(n)}p_{m}(T)\cdot|O(T)|\cdot p_{T}$ . Then evaluating $\pi_{n}(p_{m})$ at a $[n]$ -tree $S\in RB_{L}(n)$ gives

[TABLE]

Changing the order of summation we have

[TABLE]

but $\sum_{\{Q\in RB_{L}(m)|S=Q|_{[n]}\}}p_{T}(Q)=\pi_{n}(p_{T})(S)$ so we get that

[TABLE]

which shows that $p_{n}^{m}=\pi_{n}(p_{m})$ can be written as a convex combination of the $\pi_{n}(p_{T})$ . ∎

Example 2.6.

While it will be the case that $EX_{n}^{m}=\mathrm{conv}(\{\pi_{n}(p_{T}):T\in RB_{U}(m)\})$ , not every $\pi_{n}(p_{T})$ will be a vertex of $EX_{n}^{m}$ . Figure 1 illustrates this.

Lemma 2.5 implies that understanding how the marginalization map acts on the vertices of $EX_{m}$ will allow us to compute all of $EX_{n}^{m}$ . The following lemma and corollary will give us a method for calculating the vertices of $EX_{n}^{m}$ by computing subtree densities.

Lemma 2.7.

Let $S\in RB_{L}(n)$ and $T\in RB_{U}(m)$ . Also let $c_{T}(S)=|\{Q\in RB_{L}(m)|S=Q|_{[n]},shape(Q)=T\}|$ . Then $\pi_{n}(p_{T})(S)=\frac{c_{T}(S)}{|O(T)|}$ .

Proof.

By definition of the map $\pi_{n}$

[TABLE]

but $p_{T}(Q)$ is nonzero if and only if $shape(Q)=T$ , in which case it is $\frac{1}{|O(T)|}$ . So the above sum becomes

[TABLE]

∎

Corollary 2.8.

Let $S^{\prime}\in RB_{U}(n)$ and $T\in RB_{U}(m)$ . Then $\pi_{n}(p_{T})(S^{\prime})$ , which is used to denote the sum of $\pi_{n}(p_{T})(S)$ over all $S\in O(S^{\prime})$ , is the induced subtree density of $S^{\prime}$ in $T$ . That is, $\pi_{n}(p_{T})(S^{\prime})$ is the ratio of the number of times that $S^{\prime}$ occurs as a restriction tree of $T$ when $n-m$ of its leaves are marginalized out.

Proof.

From the previous lemma, we know that for any $S\in O(S^{\prime})$ , $\pi_{n}(p_{T})(S)=\frac{c_{T}(S)}{|O(T)|}$ where $c_{T}(S)=|\{Q\in RB_{L}(m)|S=Q|_{[n]},shape(Q)=T\}|$ . Then we have

[TABLE]

So for each labelling $S$ of $S^{\prime}$ , we are counting which fraction of labellings of $T$ yield $S$ when restricted to $[n]$ . As we sum over all labellings of $S$ , this gives us the total fraction of times that the shape $S^{\prime}$ appears as a restriction tree of the shape $T$ when $(n-m)$ of its leaves are marginalized out. ∎

The following examples elucidates what is meant by induced subtree density and shows how we can explicitly calculate this quantity.

Example 2.9.

We show how to find the projection of one vertex of $EX_{5}$ down to $EX_{4}$ . $EX_{4}^{5}$ is the convex hull of the projection of all of the vertices of $EX_{5}$ . Begin with the tree shape $T$ pictured in Figure 2(a). We label the leaves of $T$ for the sake of the calculation but it should be thought of as an unlabelled tree. We then find the shape of the restriction tree for the five $4$ -subsets of $[5]$ . The restriction of $T$ to the leaf sets $\{1,2,3,4\},\{1,2,3,5\},\{1,2,4,5\}$ , gives the shape $Comb_{4}$ and the restriction to the sets $\{1,3,4,5\},\{2,3,4,5\}$ gives the shape $Bal_{4}$ , pictured in Figure 2(b). We let the first coordinate of $EX_{4}$ be the probability of obtaining $Comb_{4}$ and the second be the probability of obtaining $Bal_{4}$ . As mentioned above, these probabilities will simply be the number of times each shape appears as a restriction tree over the total number of restriction trees. Thus this vertex of $EX_{5}$ will give us the distribution $(2/5,3/5)$ in $EX_{4}$ .

We have now seen how to compute the vertices of $EX_{n}^{m}$ explicitly but not every distribution $\pi_{n}(p_{T})$ is a vertex of $EX_{n}^{m}$ . However, the comb tree always yields a vertex of $EX_{n}^{m}$ .

Lemma 2.10.

For all $m\geq n$ , let $Comb_{m}\in RB_{U}(m)$ be the $m$ -leaf comb tree, then $p_{Comb_{m}}$ is a vertex in $EX_{n}^{m}$ .

Proof.

The comb tree has only smaller comb trees as restriction trees, so the image of the comb distribution on $m$ leaves under the marginalization map will be the comb distribution on $n$ leaves. Since $p_{Comb_{n}}$ is a vertex of $EX_{n}$ and $EX_{n}^{m}$ is a subset of $EX_{n}$ , then $p_{Comb_{n}}$ is also a vertex of $EX_{n}^{m}$ . ∎

3. Examples of Exchangeable and Sampling Consistent distributions

In this section we discuss some of the well-known exchangeable and sampling consistent families of distributions particularly, the Markov branching models. We also introduce a new family of exchangeable sampling consistent tree distributions, namely the multinomial family.

3.1. Markov Branching Models

An important example of sampling consistent and exchangeable distributions are the families of Markov branching models which can be constructed in the following way as first introduced in [1] by Aldous.

Suppose that for every integer $n\geq 2$ , we have a probability distribution on $\{1,2,\ldots,n-1\}$ $q_{n}=(q_{n}(i):i=1,2,\ldots n-1)$ which satisfies $q_{n}(i)=q_{n}(n-i)$ . Using this family of distributions we can define a probability distribution on $RB_{U}(n)$ by taking the probability that $i$ leaves fall on the left of the root-split and $n-i$ leaves fall on the right of the root-split to be $q_{n}(i)$ with each choice of $i$ labels to fall on the left having the same probability. Repeating recursively in each branch will yield the probability of a rooted binary tree. Aldous called these models Markov branching models.

Haas et al. classified the sampling consistent Markov branching models on rooted binary trees in [11]. They show that every sampling consistent Markov branching model, defined by the splitting rules $q_{n}$ , $n\geq 2$ , has an integral representation of the form

[TABLE]

where $c\geq 0$ , $\nu$ is a symmetric measure on $(0,1)$ such that $\int_{0}^{1}x(1-x)\nu(dx)<\infty$ , and $a_{n}$ is a normalization constant. $c1_{i=1}$ accounts for the comb distribution. A subclass of these models are those where the measure $\nu$ in equation (1) has the form $\nu(dx)=f(x)dx$ for a probability density function $f$ on $(0,1)$ that is symmetric on the interval (i.e. $f(x)=f(1-x)$ ) and where $c=0$ . These Markov branching models can be thought of as uniformly choosing $n$ points in the interval $(0,1)$ at random and then splitting the interval with respect to the density $f$ . Repeating the splitting process recursively in each subinterval until each of the original $n$ points is contained in its own subinterval gives a tree shape. This process is pictured in Figure 6 in [1].

One particularly important family of Markov branching distributions is the beta-splitting model. It is a Markov branching model that belongs to the subclass mentioned above where the function $f$ in the above description has the form

[TABLE]

for $-1<\beta<\infty$ . For the beta-splitting model we can calculate the values $q_{n}(i)$ explicitly in terms of $\beta$ . By plugging in the beta-splitting density function $f$ into (1) for $q_{n}(i)$ we get the following formulas:

[TABLE]

for $-1<\beta<\infty$ . Note that (2) gives a valid probability distribution when $-2<\beta\leq-1$ and so it is natural to extend the beta-splitting model to those values of $\beta$ , although the density is not well-defined in that case. As $\beta$ approaches $-2$ the beta-splitting model approaches the distribution which puts all probability on the comb tree, so we also include $\beta=-2$ in the beta splitting model as the comb distribution.

An important note here is that for the beta-splitting model each $q_{n}(i)$ is actually a rational function in $\beta$ . Using properties of the gamma function one can see that the above formula simplifies to

[TABLE]

Since each $q_{n}(i)$ is a rational function in $\beta$ , we can see that the probability of obtaining a certain tree shape is a rational function in $\beta$ as well because the probability of obtaining that tree shape under the beta-splitting model is simply the product of the probability of all of the splits in the tree.

Example 3.1.

Let $Comb_{4}$ and $Bal_{4}$ be the trees pictured in Figure 2(b). Then the probabilities of obtaining them under the beta-splitting model are

[TABLE]

This model also has a nice characterization among all of the sampling consistent Markov branching models. In [14], Mccullagh, Pitman, and Winkel show that the beta-splitting models are the only sampling consistent Markov branching models whose splitting rules admit a particular factorization.

We are interested in examining how the sampling consistent Markov branching models and in particular the beta-splitting model fits inside inside of $EX_{n}$ as a whole. These distributions are infinitely sampling consistent and so lie in $EX_{n}^{\infty}$ as well. A priori, it might seem that to determine the probability of a tree shape with $n$ leaves under a Markov branching model that one would need to have not only the distribution $q_{n}$ but also distributions $q_{k}$ where $2\leq k\leq n-1$ . This is actually not the case for any sampling consistent Markov branching model though. Ford showed in Proposition 41 of [8] that if $(q_{k}|2\leq k\leq n)$ are the splitting rules for a distribution in $EX_{n}^{\infty}$ , then in fact it must be that

[TABLE]

This implies that all that is needed to define a distribution in $EX_{n}^{\infty}$ is the first splitting rule $q_{n}$ which gives the following corollary.

Corollary 3.2.

The dimension of the set of all sampling consistent Markov branching models in $EX_{n}$ is at most $\lceil{\frac{n-1}{2}}\rceil-1$

Proof.

As explained above, a Markov branching model is completely determined by the distribution $q_{n}=(q_{n}(i):i=1,2,\ldots n-1)$ which determines all of the distributions $q_{k}=(q_{k}(i):i=1,2,\ldots k-1)$ where $2\leq k\leq n-2$ . Since $q_{n}$ must be symmetric we immediately get that the values $q_{1},q_{2},\ldots,q_{\lceil{\frac{n-1}{2}}\rceil}$ determine all of $q_{n}$ . Also since $q_{n}$ must be a distribution we lose one of these as a free parameter, thus the dimension of the set of sampling consistent Markov branching models is bounded above by $(\lceil{\frac{n-1}{2}}\rceil-1)$ . ∎

Note that when $n=4$ , the space of sampling consistent Markov branching models has dimension $1$ . We will see in Section 4 that the set of beta-splitting models is equal to the set of sampling consistent Markov branching models in this case.

3.2. Multinomial model

The multinomial model is a model that associated to each tree shape $T\in RB_{U}(m)$ for any $m\geq 2$ a family of probability distributions on $RB_{L}(n)$ for each $n$ . We will often extend the model to allow to use extended trees with an additional leaf added to the root. We associate to every edge, $e$ , in $T$ a parameter $t_{e}\geq 0$ . This gives us a vector of parameters $t=(t_{e}|e\in E(T))$ of length $2m-1$ , and we assume that $\sum_{e}t_{e}=1$ , so that these parameters give a probability distribution on the edges of $T$ . We will now use this probability distribution to define a set of distributions on $RB_{U}(n)$ for any $n\geq 2$ . Note that $n$ and $m$ do not have to be related to each other.

Using the distribution $t$ , we draw a multiset $A$ of edges from the tree $T$ , where edge $e$ occurs with probability $t_{e}$ . There is a natural way to take the tree $\tilde{T}$ and a multiset $A$ of size $n$ on the set of parameters and construct a new tree which we will call $T_{A}\in RB_{U}(n)$ . Each time that an edge $e$ appears in $A$ , we add a new leaf to the edge $e$ , which will give us a new tree with an undetermined number of leaves. We then simply take $T_{A}$ to be the induced subtree on only the leaves that come from $A$ . Hence, the multinomial model on the tree $T$ gives a way to produce random trees with an underlying skeleton that is the tree $T$ . For large $n$ , the resulting random trees look like $T$ with many extra leaves added.

The multinomial probability of observing a particular multiset of edges $A$ is the monomial

[TABLE]

where $m_{A}(e)$ denotes the number of times that $e$ appears in the multiset $A$ , and $m_{A}$ is the resulting vector.

Letting $M_{n}^{T}$ be the set of all $n$ element multisets of edges of $T$ , we can calculate the probability of observing any particular tree shape $S$ by

[TABLE]

Example 3.3.

Consider the tree $T$ from Figure 3(b) with edge parameters $(t_{1},t_{2},t_{3})$ . To calculate the probability of the tree, $Bal_{5}$ , in Figure 3(c) we use the formula

[TABLE]

The only multisets that satisfy this condition are the sets $A_{1}=\{2,2,2,3,3\}$ and $A_{2}=\{2,2,3,3,3\}$ . This is because if $1$ appears in a multiset $A$ any positive number of times, the tree $T_{A}$ will have a single leaf on one side of the root and four leaves on the other side, regardless of what other parameters appear in the set. So $A_{1}$ and $A_{2}$ are the only elements of $M_{5}^{T}$ that we sum over so

[TABLE]

The multinomial model gives a family of distributions as we let the parameter vector $t$ range over the entire simplex. Equivalently, the model can be described as the image of the simplex under the polynomial map

[TABLE]

where the coordinate corresponding to $S\in EX_{n}^{\infty}$ has value $p_{T,t}(S)$ for $t\in\Delta_{2m-2}$ . Since $\Delta_{2m-2}$ is a semialgebraic set and $p_{T}$ is a polynomial map, the multinomial model is also a semialgebraic set.

It also holds that if we take any tree $T\in RB_{U}(m)$ , and any subtree $T^{\prime}\in RB_{U}(m^{\prime})$ of $T$ , then we have that $Im(p_{T^{\prime}})\subseteq Im(p_{T})$ . This is because if the parameters corresponding to edges that appear in $T$ but not in $T^{\prime}$ are set to [math] in $p_{T}$ , the map will simply become $p_{T^{\prime}}$ . Setting these parameters to [math] just corresponds to restricting $p_{T}$ to a subset of the simplex and thus we get the image containment.

A last interesting note is that this model is perhaps similar in spirit to the $W$ -random graphs when $W$ is a graphon obtained from a finite graph $G$ as described in [13]. The construction begins with a finite graph $G$ and uses it to define a distribution on graphs with $k$ vertices similarly to how we begin with a tree $T$ and define a distribution on trees with $k$ leaves.

We end this section with Figure 4, which shows both the beta-splitting model and the multinomial model inside $EX_{5}$ . In the next section we will discuss the exchangeable and sampling consistent distributions on four leaf trees and how they relate to the models discussed in this section.

4. Distributions in $EX_{4}^{\infty}$

In this section we classify all of the distributions in $EX_{4}^{\infty}$ . In particular, we show that $EX_{4}^{\infty}$ is equal to the beta-splitting model.

First we note that since there are only two distinct tree shapes with four leaves (see Figure 3(a)), the set of exchangeable distributions is just a 1-dimensional simplex $\Delta_{1}$ in $\mathbb{R}^{2}$ . We take coordinates $(p_{1},p_{2})$ on $\mathbb{R}^{2}$ and let the first coordinate correspond to $Comb_{4}$ and the second coordinate to $Bal_{4}$ . The subset of distributions that are also sampling consistent must be some line segment within the simplex. We know from Lemma 2.10 that the comb distribution, which is $(1,0)$ in these coordinates, is a vertex in $EX_{4}^{\infty}$ . If we can bound the probability of obtaining $Bal_{4}$ then we will have a complete characterization of all distributions in $EX_{4}^{\infty}$ . Theorem 14 in [4] will be the main tool to achieve this.

Theorem 4.1.

[4, Thm 14]** The most balanced tree in $RB_{U}(n)$ has the complete symmetric tree on four leaves appear more frequently as a subtree than any other tree in $RB_{U}(n)$ .

By the most balanced tree in $RB_{U}(n)$ , we mean the unique tree shape in $RB_{U}(n)$ that has the property that for any internal vertex of the tree, the number of leaves on the left and right subtrees below that differ by at most one.

Theorem 4.2.

The four leaf beta-splitting model equals the set of all exchangeable and sampling consistent distributions on $RB_{U}(4)$ .

Proof.

Note that $EX_{4}^{n}$ only has two vertices since it is a line segment. The comb distribution $(1,0)$ is always a vertex in $EX_{4}^{n}$ , by Lemma 2.10. The other vertex will be the projection of the vertex of $EX_{n}$ that places the most mass on $Bal_{4}$ . The projection of a vertex $p_{T}\in EX_{n}$ , is $(p_{1},p_{2})=\frac{1}{\binom{n}{4}}(m_{1},m_{2})$ where $m_{1}$ is the number of $4$ element subsets $S\subset[n]$ such that $T|_{S}=Comb_{4}$ and $m_{2}$ is the number of $4$ element subsets $S\subset[n]$ such that $T|_{S}=Bal_{4}$ . By Theorem 4.1 we can restrict to the most balanced tree in $RB_{U}(n)$ . We will use $m_{2,n}$ to denote this highest value of $m_{2}$ that we get from the most balanced tree in $RB_{U}(n)$ .

The beta-splitting model on $RB_{U}(4)$ , on the other hand, is the line segment from $(1,0)$ to $(\frac{4}{7},\frac{3}{7})$ . Indeed, under the beta splitting model, the probability of $Bal_{4}$ is just

[TABLE]

As $\beta\rightarrow\infty$ , this converges to $\tfrac{3}{7}$ . So if we can show that $\lim_{n\rightarrow}\tfrac{m_{2,n}}{\binom{n}{4}}=\tfrac{3}{7}$ then we will be done.

To prove that $\lim_{n\rightarrow}\tfrac{m_{2,n}}{\binom{n}{4}}=\tfrac{3}{7}$ , we can restrict to the subsequence of values $n=2^{k}$ , since Lemma 2.4 implies that $\tfrac{m_{2,n}}{\binom{n}{4}}$ is a monotone decreasing sequence. This subsequence is easier to deal with since $m_{2,2^{n}}$ counts the number of $4$ -subsets, $S\subset[2^{n}]$ of the leaves of the complete symmetric tree $T_{2^{n}}$ in $RB_{U}(2^{n})$ such that $T_{2^{n}}|_{S}=Bal_{4}$ . It is not hard to come up with a simple recurrence for this though since $T_{2^{n}}$ has the recursive structure as illustrated in Figure 5.

Note that $m_{2,2^{n}}=2m_{2,2^{n-1}}+\binom{2^{n-1}}{2}^{2}$ since the only ways we can choose a subset $S$ such that $T_{2^{n}}|_{S}=Bal_{4}$ are that the leaves in $S$ fall either entirely within the left or right subtrees or that $S$ has two leaves from both the left and right subtrees. The number of ways to choose a subset $S$ that falls entirely on the left or right side is $m_{2,2^{n-1}}$ by definition. The number of ways to choose two leaves from each side is $\binom{2^{n-1}}{2}^{2}$ . This recurrence can be solved to find an explicit formula for $m_{2,2^{n}}$ which is

[TABLE]

Now we can simplify $\frac{m_{2,2^{n}}}{\binom{2^{n}}{4}}$ to get

[TABLE]

which converges to $\frac{3}{7}$ as $n$ tends to infinity. ∎

Note that Theorem 4.2 does not generalize to higher dimensions as the set of beta splitting distributions is of strictly smaller dimension than the set of exchangeable sampling consistent distributions. We explore the discrepancy between these sets in more detail in the next sections.

5. Distributions on $EX_{5}^{\infty}$

There are three distinct tree shapes with five leaves so $EX_{5}$ is a $2$ -dimensional simplex in $\mathbb{R}^{3}$ . For the rest of this section we will use $Comb_{5}$ , $Gir_{5}$ , and $Bal_{5}$ to represent the trees pictured in Figure 6. Specifically, let $Comb_{5}$ denote the comb tree on five leaves, $Bal_{5}$ denote the balanced tree on five leaves and $Gir_{5}$ denote the giraffe tree on five leaves. We take coordinates $(p_{1},p_{2},p_{3})$ on $\mathbb{R}^{3}$ where $p_{1},p_{2},p_{3}$ represent the probability of obtaining $Comb_{5}$ , $Gir_{5}$ , and $Bal_{5}$ , respectively.

While have not been able to give a complete description of the vertices of $EX_{5}^{n}$ for all $n$ , we are able to define some tree structures in $RB_{U}(n)$ that do yield vertices of $EX_{5}^{n}$ . We have already seen that the comb tree $Comb_{m}$ always yields a vertex of $EX_{n}^{m}$ for all $m$ and $n$ . Here we provide some other examples.

Definition 5.1.

For a tree $T\in RB_{U}(m)$ let $comb(T,n)$ be the tree that is obtained by creating a comb tree with $n$ leaves and replacing one of the two leaves at the deepest level with the tree $T$ .

Generally, if $T\in RB_{U}(m)$ then $comb(T,n)$ has $m+n-1$ vertices. For example, $Gir_{5}=comb(Bal_{4},2)$ . Note that does not matter which of the leaves is replaced with $T$ since our trees are unlabelled.

Proposition 5.2.

Let $T_{n}=comb(Gir_{5},n-4)$ . Then $\pi_{5}(p_{T_{n}})$ is a vertex in $EX_{5}^{n}$ .

Proof.

First note that $T_{n}$ and $Comb_{n}$ are the only trees with $n$ leaves that do not have $Bal_{5}$ as a subtree. This means that $T_{n}$ and the comb tree fall on the line $p_{3}=0$ in $EX_{5}$ . Thus, the set $\{(p_{1},p_{2},0)\in EX_{n}\}\cap EX_{5}^{n}$ is a face of $EX_{n}^{n}$ for all $n\geq 5$ , since every distribution $p\in EX_{5}$ must satisfy the condition $p_{1}+p_{2}+p_{3}=1$ and thus $p_{1}+p_{2}\leq 1$ . Since $p_{1}+p_{2}=1$ is the same line as $p_{3}=0$ it defines a face. Now since $\pi_{5}(p_{T_{n}})$ and $\pi_{5}(p_{Comb_{n}})$ are different points are the only distributions of the form $\pi_{5}(p_{T})$ in this face, they must be vertices of this face and thus vertices of $EX_{5}^{n}$ . ∎

We now introduce another tree structure that will yield a vertex in $EX_{5}^{n}$ .

Definition 5.3.

For two positive integers $m$ and $n$ let $bicomb(m,n)$ denote the tree made by joining a comb tree of size $m$ and a comb tree of size $n$ together at a new root. We call such trees bicomb trees.

For example, $Bal_{5}=bicomb(2,3)$ .

Lemma 5.4.

Let $T_{n}=bicomb(\lfloor\frac{n}{2}\rfloor,\lceil\frac{n}{2}\rceil)$ . Then $\pi_{5}(p_{T_{n}})$ is a vertex of $EX_{5}^{n}$ .

Proof.

First note that for $n\geq 5$ , the only trees in $RB_{U}(n)$ that never contain $Gir_{5}$ as a restriction tree are the comb tree and the bicomb trees. This means that in $EX_{5}^{n}$ , they are the only trees that fall on the edge $p_{2}=0$ . To show that $\pi_{5}(p_{T_{n}})$ is a vertex of $EX_{5}^{n}$ it remains to to show that $\pi_{5}(p_{T_{n}})$ is extremal on this edge. We know that the comb tree is one of the extremal points on this edge and so the other extremal point will correspond to the bicomb tree with the highest density of $Bal_{5}$ as a restriction tree. Let $T^{\prime}=bicomb(i,n-i)$ be a bicomb tree for some $1\leq i\leq n-1$ . We let $b_{5}(T^{\prime})$ denote the number of times that $Bal_{5}$ occurs as a restriction tree of $T^{\prime}$ . From the structure of a bicomb tree we have

[TABLE]

This function is maximized when $i=\lfloor\frac{n}{2}\rfloor$ . ∎

Now we will show that the projection of the most balanced tree in $RB_{U}(n)$ is a vertex of $EX_{5}^{n}$ . To do this, we prove a few lemmas about the number of $Comb_{5}$ trees that can appear as subtrees of a tree. These results follow the basic outline of Lemmas 12 and 13 in [4], and are in some sense an extension of those results to $5$ leaf trees.

For a tree $T\in RB_{U}(n)$ let $c_{5}(T)$ count the number of $5$ -subsets, $S$ , of the leaves of $T$ such that $T|_{S}=Comb_{5}$ . Let $c_{4}(T)$ be defined similarly, but for $4$ leaf comb trees.

Lemma 5.5.

Let $T$ be as it is pictured in Figure 7 and $T^{\prime}$ obtained from $T$ by swapping the positions of $T_{2}$ and $T_{3}$ . For $i=0,1,2,3,4$ , let $n_{i}=\#L(T_{i})$ and without loss of generality choose $n_{1}\geq n_{2}$ and $n_{3}\geq n_{4}$ . If $n_{1}>n_{3}$ and $n_{2}>n_{4}$ then $c_{5}(T)\geq c_{5}(T^{\prime})$ . Furthermore, if $n\geq 7$ , then $c_{5}(T)>c_{5}(T^{\prime})$ .

Proof.

Without loss of generality assume that $n_{1}\geq n_{2}$ and $n_{3}\geq n_{4}$ and let $\Sigma_{z}$ denote the set of leaves of $T$ below the vertex $z$ . Note that by construction, this is the same as the set of leaves below the vertex $z$ in $T^{\prime}$ . If we take a $5$ -subset, $S$ , of the leaves of $T$ and $T^{\prime}$ then it is only possible for $T|_{S}\neq T^{\prime}|_{S}$ if $|S\cap\Sigma_{z}|\geq 4$ . It is straightforward to see that if $S\cap\Sigma_{z}$ has zero, one, two, or three elements, $T|_{S}=T^{\prime}|_{S}$ .

This means

[TABLE]

where $T_{z}$ and $T^{\prime}_{z}$ denote the subtrees of $T$ and $T^{\prime}$ below $z$ . Note that for any tree $S\in RB_{U}(n)$ , it holds that

[TABLE]

which gives

[TABLE]

and $(b_{4}(T^{\prime}_{z})-b_{4}(T_{z}))$ is guaranteed to be positive by Lemma 12 of [4] so the term $n_{0}(b_{4}(T^{\prime}_{z})-b_{4}(T_{z}))$ is nonnegative. It remains to show that $(c_{5}(T_{z})-c_{5}(T^{\prime}_{z}))$ is nonnegative. We can explicitly enumerate these quantities in the following way:

[TABLE]

We can simplify this to get that

[TABLE]

Note that this quantity is greater than [math] since $n_{1}>n_{3}$ and $n_{2}>n_{3}$ by assumption and $n_{i}\geq 1$ for $i=1,2,3,4$ . Note that if $n\geq 7$ , then we either have that $n_{0}\geq 1$ , or $\sum_{i=1}^{4}n_{i}\geq 7$ which both guarantee that $c_{5}(T)-c_{5}(T^{\prime})>0$ . ∎

This lemma essentially tells us that if the tree has an internal node that is unbalanced, we can find a tree that has $Comb_{5}$ appear less frequently as a restriction tree. We now have another lemma following in the style of [4].

Lemma 5.6.

Let $T$ be as it is pictured in Figure 8 and for $i=0,1,2$ , let $n_{i}=\#L(T_{i})$ and assume $n_{1}\geq n_{2}$ . We also assume that $n_{1}+n_{2}\geq 3$ . Then $c_{5}(T)\geq c_{5}(T^{\prime})$ . Furthermore, if $n\geq 7$ , then $c_{5}(T)>c_{5}(T^{\prime})$ .

Proof.

We will again proceed by showing that $c_{5}(T)-b_{5}(T^{\prime})>0$ . By the same reasoning as that given in the last lemma we know that

[TABLE]

and the nonnegativity of the second term follows in the same manner that was described in the previous lemma. Now we can easily see that

[TABLE]

and so

[TABLE]

It is clear that the right hand side is always nonnegative. Note that if $n\geq 7$ , then either $n_{0}\geq 1$ or $n_{1}\geq 3$ . In both cases this guarantees that $c_{5}(T)-c_{5}(T^{\prime})>0$ .

∎

Combining these two lemmas together we get the following theorem. This theorem will immediately allow us to show that the projection of the most balanced tree in $RB_{U}(n)$ will always be a vertex in $EX_{5}^{n}$ .

Theorem 5.7.

For $n\geq 7$ , the minimum value of $c_{5}(T)$ is attained when every internal node of $T$ is maximally balanced.

Proof.

This proof also follows the strategy of [4]. We assume that $c_{5}$ obtains it minimum value in $RB_{U}(n)$ at $T$ but that $T$ is not maximally balanced. We will try to find a contradiction. We let $z$ be a non-balanced internal node with balanced children $a$ and $b$ . We let $n_{a}$ and $n_{b}$ be the number of leaves of the trees rooted at $a$ and $b$ respectively. Then since $z$ is not balanced we have, without loss of generality, that $n_{a}\geq n_{b}+2$ . If $b$ is a leaf then by Lemma 5.6 we immediately have that $c_{5}(T)$ is not minimum since $n\geq 7$ . So we have that $n_{b}\geq 2$ and thus both $a$ and $b$ are balanced and must be internal nodes.

We now let $v_{1},v_{2}$ be the children of $a$ and $v_{3},v_{4}$ be the children of $b$ and take $n_{i}=\#L(T_{v_{i}})$ for $i=1,2,3,4$ and once again without loss of generality assume that $n_{1}\geq n_{2}$ and $n_{3}\geq n_{4}$ . Since both $a$ and $b$ are balanced it must be that $n_{1}=n_{2}$ or $n_{1}=n_{2}+1$ and $n_{3}=n_{4}$ or $n_{3}=n_{4}+1$ . Then the assumption that $n_{a}\geq n_{b}+2$ immediately gives us that

[TABLE]

Then by previous assumptions we get that $n_{1}>n_{3}$ . Now since $c_{5}$ is minimum at $T$ and $n\geq 7$ , we can apply Lemma 5.5 to get that $n_{4}\geq n_{2}$ . Stringing together these inequalities we get that

[TABLE]

But since $n_{1}=n_{2}$ or $n_{1}=n_{2}+1$ , the only possibility we have is that

[TABLE]

But then we get that $n_{1}+n_{2}=2n_{1}-1$ and $n_{3}+n_{4}=2n_{1}-2$ which contradicts the inequality $n_{1}+n_{2}\geq n_{3}+n_{4}+2$ . This tells us that any tree with at least $7$ leaves must be maximally balanced around every internal node if it obtains the minimum value of $c_{5}$ on $RB_{U}(n)$ . Since there is only one tree that is maximally balanced at every internal node, there is a unique minimizer of $T$ in $RB_{U}(n)$ for $n\geq 7$ . ∎

Corollary 5.8.

Let $T_{n}$ be the maximally balanced tree in $RB_{U}(n)$ . Then $\pi_{5}(T_{n})$ is a vertex of $EX_{5}^{n}$ .

Proof.

The Corollary can be verified computationally for $n=6$ . For $n\geq 7$ Theorem 5.7 shows that $T_{n}$ is the unique tree that attains the minimum value of $c_{5}$ among all trees in $RB_{U}(n)$ . So it holds that $\{(p_{1},p_{2},c_{5}(T_{n}))\in EX_{5}\}\cap EX_{5}^{n}=\{\pi_{5}(T_{n})\}$ , thus $\pi_{5}(T_{n})$ is a vertex of $EX_{5}^{n}$ . ∎

We have another Corollary that relates the exchangeable and sampling consistent distributions to the $\beta$ -splitting model.

Corollary 5.9.

The projection of the most balanced tree in $EX_{5}^{n}$ approaches the $\beta=\infty$ point on the beta-splitting model as $n\to\infty$ .

Proof.

It is enough to show that the complete symmetric tree $T_{2^{n}}\in RB_{U}(2^{n})$ satisfies this property. We can just count the number of times that $Gir_{5}$ and $Bal_{5}$ occur as restriction trees when we restrict to a 5-subset of the leaves. We will call these quantities $g_{5,2^{n}}$ and $b_{5,2^{n}}$ respectively. Once again since $T_{2^{n}}$ has the structure depicted in Figure 5 and we can use this structure to write down a simple recurrence for $g_{5,2^{n}}$ and $b_{5,2^{n}}$ and then solve the recurrence. Since we can either choose our subset to be on either the right or left side of the tree or 3 leaves from one side and 2 leaves from the other, $b_{5}$ is simply

[TABLE]

As for $g_{5}$ , we can once again choose our subset to be on either the right or left side of the tree or we can choose to have 1 leaf on a side of the tree and a 4 leaf symmetric tree on the other. This can be done in just $2^{n-1}m_{2,2^{n-1}}$ ways. So $g_{5}(T_{2^{n}})$ is just

[TABLE]

Both of these recurrences can be solved explicitly using a computer algebra system. We get that

[TABLE]

We can then find the probabilities $p_{2}$ and $p_{3}$ of $Gir_{5}$ and $Bal_{5}$ by simply dividing out by $\binom{2^{n}}{5}$ . This yields

[TABLE]

Clearly as $n\rightarrow\infty$ we have $p_{3}\rightarrow\frac{2}{3}$ and $p_{2}\rightarrow\frac{1}{7}$ .

On the other hand, we recall that the probability of obtaining a tree under the beta-splitting model is just a rational function in $\beta$ that can be explicitly calculated. We can then find the limit of these rational functions to get that the beta-splitting curve approaches the point

[TABLE]

as $\beta\to\infty$ as well and so the projection of $T_{2^{n}}$ in $EX_{5}^{2^{n}}$ is approaching the $\beta=\infty$ point on the curve. ∎

These are all of the tree structures in $RB_{U}(n)$ we have been able to find that always appear as vertices in $EX_{5}^{n}$ . We end this section with Figure 9, which pictures all of the families of exchangeable and sampling consistent distributions that we have discussed and the vertices of $EX_{n}^{m}$ for some small values of $m$ .

6. Distributions on $EX_{n}^{\infty}$

While we are not able to get a description of the vertices of $EX_{n}^{m}$ for general $m$ and $n$ , it is possible to to describe $EX_{n}^{\infty}$ using the multinomial model that was introduced in Section 3.2. In particular, this shows that multinomial models converge as an inner limit to $EX_{n}^{\infty}$ .

Theorem 6.1.

Let $\{T_{m}\}_{m=n}^{\infty}$ be a sequence of tree shapes and $p^{(m)}=\pi_{n}(T_{m})$ be the corresponding sequence of distributions. If $p^{(m)}$ converges to some $p\in EX_{n}^{\infty}$ as $m$ goes to infinity, then there exists a sequence of multinomial distributions $\{d^{(m)}\}_{m=n}^{\infty}$ that also converges to $p$ as $m$ goes to infinity.

Proof.

Define $d^{(m)}$ to be the multinomial distribution on the tree $T_{m}$ with the edge parameter vector $(t_{e}|e\in E(T_{m}))$ such that $t_{e}=\frac{1}{m}$ if one of the vertices in $e$ is one of the original $m$ leaves of $T_{m}$ and $t_{e}=0$ otherwise. Note that these nonzero edge parameters are bijectively associated to the leaves of $T_{m}$ and we may call the set of nonzero edge parameters $L(T_{m})$ meaning the leaf set of $T_{m}$ . To show that $d^{(m)}$ also converges to $p$ , it is enough to show that for every tree $T\in RB_{U}(n)$ , $\lim_{m\to\infty}d^{(m)}(T)=\lim_{m\to\infty}p^{(m)}(T)$ . Fix a labelling of $T_{m}$ and let $c_{T_{m}}(T)$ be the number of sets $S\subseteq[m]$ such that $shape(T_{m}|_{S})=T$ . By Corollary 2.8, $p^{(m)}(T)$ is the induced subtree density of $T$ in $T_{m}$ , so $p^{(m)}(T)=\frac{c_{T_{m}}(T)}{\binom{m}{n}}$ . So

[TABLE]

On the other hand, let $M^{(m)}=\{A\in M_{n}^{T_{m}}|{T_{m}}_{A}=T,poly(A)\neq 0\}$ , then

[TABLE]

by definition and we note by requiring that multisets $A\in M^{(m)}$ have that $poly(A)\neq 0$ , $M^{(m)}$ only includes multisets whose support is contained in $L(T_{m})$ . Also note that $poly(A)$ is either [math] or $\binom{n}{m_{A}(t_{e_{1}}),m_{A}(t_{e_{2}}),\ldots m_{A}(t_{e_{2m-1}})}\frac{1}{m^{n}}$ since all the edge parameters are [math] or $\frac{1}{m}$ . So to understand the quantity $d^{(m)}(T)$ it is enough to know the coefficient of $\frac{1}{m^{n}}$ . Note that any multiset $A$ has a naturally associated integer partition of $n$ to it, formed by taking the multiplicities of each unique element that appears in it. Call this integer partition the weight of $A$ , denoted $wt(A)$ , and let $M_{\lambda}^{(m)}$ be the set of multisets in $M^{(m)}$ with weight $\lambda$ . Now observe that for $A,B\in M_{\lambda}^{(m)}$ , $poly(A)=poly(B)$ since the value of the multinomial coefficient is totally determined by the weight and the product of the edge parameters is always $\frac{1}{m^{n}}$ . If we let $\binom{n}{\lambda}$ be the value of the multinomial coefficient then the formula for $d^{(m)}(T)$ can be rewritten as

[TABLE]

but we can bound the quantity $|M_{\lambda}^{(m)}|$ . We note that the quantity $|(M_{n}^{T_{m}})_{\lambda}|$ , of all multisets on the edge parameters of $T_{m}$ of size, with weight $\lambda$ , is at most $l(\lambda)!\binom{m}{l(\lambda)}$ where $l(\lambda)$ is the length of the partition $\lambda$ . This is because there are $\binom{m}{l(\lambda)}$ choices for which elements to use in the multiset and at most $l(\lambda)!$ unique multisets for each choice of elements. Since $l(\lambda)!\binom{m}{l(\lambda)}$ is a polynomial in $m$ of degree $l(\lambda)$ though, we have that

[TABLE]

since the partition $\lambda=(1,1,\ldots 1)$ is the only partition where $|M_{(1,1,\ldots,1)}^{(m)}|$ is of the order $m^{n}$ , and so is the only term that contributes to the limit. Now we note that the multisets $A\in M_{(1,1,\ldots,1)}^{(m)}$ correspond exactly to choosing subsets of the leaves of $T_{m}$ that yield $T$ upon restriction since the only edges that can be in $A$ are those corresponding to leaves, every leaf can be chosen at most once, and $shape({T_{m}}_{A})=T$ . So $|M_{(1,1,\ldots,1)}^{(m)}|=c_{T_{m}}(T)$ , and so

[TABLE]

and since $p^{(m)}$ converges, to $p$ , it must be that $d^{(m)}$ also does. ∎

Corollary 6.2.

Suppose that $p\in EX_{n}^{m}$ for some $m>n$ . Then for any tree $S\in RB_{U}(n)$ , $p(S)$ can be approximated with a distribution $d\in EX_{n}^{\infty}$ with error $\frac{C}{m}$ , where $C$ is a constant with respect to $m$ that does not depend on the tree $S$ .

Proof.

Note that if $p\in EX_{n}^{m}$ , then we have for every $S\in RB_{U}(n)$ ,

[TABLE]

where the above combination is convex by Lemma 2.5. Then let $d^{T}$ be defined as the multinomial distribution $d^{T}$ on $T$ just as $d^{(m)}$ is defined for $T_{m}$ in the previous theorem. Then recall from the proof of the previous theorem that

[TABLE]

where $M_{\lambda}^{T}=\{A\in M_{n}^{T}|T_{A}=S,~{}poly(A)\neq 0,~{}wt(A)=\lambda\}$ . Also recall from the proof of the previous theorem that $|M_{(1,1,\ldots,1)}^{T}|=c_{T}(S)$ . Combining these facts with the definition of $\pi_{n}(p_{T})$ and the triangle inequality gives

[TABLE]

and we now bound each term on the right hand side of this inequality.

To bound the first term in equation (4), note that $c_{T}(S)$ is a nonnegative quantity and is bounded above by $\binom{m}{n}$ . This gives the inequality

[TABLE]

where $C_{1}\in\mathbb{R}$ is a constant. Note that this constant does not depend on the trees $T$ and $S$ .

To bound the second term we again recall from the proof of the previous theorem that $|M_{\lambda}^{T}|\leq l(\lambda)!\binom{m}{l(\lambda)}$ for each partition $\lambda$ of $n$ . Then we have that

[TABLE]

but since $\lambda\neq(1,1,\ldots,1)$ , it must be that $l(\lambda)\leq n-1$ so $l(\lambda)!\binom{m}{l(\lambda)}\leq m^{n-1}$ for all the remaining partitions $\lambda$ . Applying this fact to the right hand side of equation (6) gives the bound

[TABLE]

where $C_{2}\in\mathbb{R}$ is a constant that also does not depend on the trees $T$ and $S$ . Applying the bounds for each term to equation (4) and setting $C=C_{1}+C_{2}$ gives

[TABLE]

and again we note that $C$ is independent of the trees $T$ and $S$ since $C_{1}$ and $C_{2}$ are. We are now ready to construct a distribution $d\in EX_{n}^{\infty}$ that gives the desired result. From the discussion of the multinomial model, we have that each distribution $d^{T}\in EX_{n}^{\infty}$ and so from the convexity of $EX_{n}^{\infty}$ we get

[TABLE]

We can now use the expression for $p$ we began with and the bound obtained in equation (8) to get that

[TABLE]

∎

Theorem 6.1 gives that the limit of any convergent sequence $(v_{m})_{m\geq 1}$ where $v_{m}\in V(EX_{n}^{m})$ can also be realized as the limit of points coming from multinomial models. Corollary 6.2 shows that if we have a distribution in $EX_{n}$ that can be extended to part of a finitely sampling consistent family, then it can be approximated with an infinitely sampling consistent distribution. With Theorem 6.1 and the following theorem, we will show that $EX_{n}^{\infty}$ is actually the convex hull of all limits of convergent sequences of vertices, and thus the convex hull of limits of distributions drawn from the multinomial model. To do this we need a basic proposition from convex analysis which the proof of is included for completeness.

Proposition 6.3.

Let $(P_{m})_{m\geq 1}$ be a sequence of polytopes in $\mathbb{R}^{n}$ such that for all $m\geq 1$ , $P_{m+1}\subseteq P_{m}$ . Let

[TABLE]

where the bar denotes the closure in the Euclidean topology. Then $P=\cap_{m=1}^{\infty}P_{m}$ .

Proof.

It is straightforward to see that $P\subseteq\cap_{m=1}^{\infty}P_{m}$ . To show that the sets are equal suppose that there is $p\in(\cap_{m=1}^{\infty}P_{m})\setminus P$ . Then the Basic Separation Theorem of convex analysis implies there must exist an affine functional $\ell$ with $\ell(p)\leq 0$ and $\ell(w)>0$ for all $w\in P$ . We also have that since $p\in\cap_{m=1}^{\infty}P_{m}$ , for each $m\geq 1$ , $p$ can be written as

[TABLE]

where the $v_{j}^{(m)}$ are the vertices of $P_{m}$ . Then because $\ell(p)<0$ it must be that for each $m$ , there exists at least one vertex $v_{i_{m}}^{(m)}$ of $P_{m}$ such that $\ell(v_{i_{m}}^{(m)})<0$ . Since all the points $v_{j}^{m}$ lie in $P_{1}$ which is a compact set, there exists a convergent subsequence $(v_{i_{m_{k}}}^{(m_{k})})_{k\geq 1}$ with limit $v\in P$ , thus $\ell(v)>0$ . But it also holds that

[TABLE]

which is a contradiction. ∎

Corollary 6.4.

Let $d_{T_{m}}^{(m)}$ denote the specific multinomial model construction on the tree $T_{m}\in RB_{U}(m)$ described in Theorem 6.1. Then

[TABLE]

Proof.

Recall that $EX_{n}^{\infty}=\cap_{m=n}^{\infty}EX_{n}^{m}$ , thus by Proposition 6.3,

[TABLE]

since the vertices of $EX_{n}^{m}$ correspond to a subset of the points $\pi_{n}(T_{m})$ . Applying Theorem 6.1 to the sequence $(\pi_{n}(T_{m}))_{m\geq 1}$ gives the result. ∎

Corollary 6.4 shows that every exchangeable and infinitely sampling consistent distribution is either a convex combinations of limits of multinomial distributions or a limit point of points in that set. Understanding the structure of the multinomial models may shed greater light on the structure of $EX_{n}^{\infty}$ as a whole. We view Corollary 6.2 and Corollary 6.4 as the rooted binary tree analogue to Theorems 3 and 4 in [6], in essence they are finite forms of a deFinetti-type theorem for rooted binary trees. As previously mentioned, the work done in [10] and [9] establishes a more typical deFinetti theorem in the sense that it shows every infinitely sampling consistent sequence of distributions can be obtained by sampling from a limit object using techniques from Probability theory.

We also note that the requirement that the induced subtree densities converge is quite similar to the idea of graph convergence that appears in [13] and that many of the ideas in the theory of graph limits may also be applied to trees. The very well developed theory of graph limits contains many equivalent versions of the limiting object (see Theorem 11.52 in [13]). The work done in [10] and [9] makes the connection between the limiting object,a random real tree, and an infinitely sampling consistent model. It is still unknown if this can be connected to ideas such as tree parameters (the induced subtree density for instance) and to metrics on finite trees as has been done in the theory of graph limits. It seems that many of these equivalences hold but differences in techniques will be required.

Acknowledgments

Benjamin Hollering and Seth Sullivant were partially supported by the US National Science Foundation (DMS 1615660). Thanks to Dávid Papp for a helpful conversation regarding Proposition 6.3.

Bibliography16

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] David Aldous. Probability distributions on cladograms. In Random discrete structures (Minneapolis, MN, 1993) , volume 76 of IMA Vol. Math. Appl. , pages 1–18. Springer, New York, 1996.
2[2] Daniel Irving Bernstein, Lam Si Tung Ho, Colby Long, Mike Steel, Katherine St. John, and Seth Sullivant. Bounds on the expected size of the maximum agreement subtree. SIAM J. Discrete Math. , 29(4):2065–2074, 2015.
3[3] David Bryant, Andy Mc Kenzie, and Mike Steel. The size of a maximum agreement subtree for random binary trees. In Bioconsensus (Piscataway, NJ, 2000/2001) , volume 61 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci. , pages 55–65. Amer. Math. Soc., Providence, RI, 2003.
4[4] T. M. Coronado, A. Mir, F. Rosselló, and G. Valiente. A balance index for phylogenetic trees based on quartets. Ar Xiv e-prints , March 2018.
5[5] D. M. de Vienne, T. Giraud, and O.C. Martin. A congruence index for testing topological similarity between trees. Bioinformatics , 23:3119–3124, 2007.
6[6] Persi Diaconis. Finite forms of de finetti’s theorem on exchangeability. Synthese , 36(2):271–281, Oct 1977.
7[7] Persi Diaconis and Svante Janson. Graph limits and exchangeable random graphs. Rend. Mat. Appl. (7) , 28(1):33–61, 2008.
8[8] Daniel J. Ford. Probabilities on cladograms: Introduction to the alpha model . Pro Quest LLC, Ann Arbor, MI, 2006. Thesis (Ph.D.)–Stanford University.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Exchangeable and Sampling Consistent Distributions on Rooted Binary Trees

Abstract.

1. Introduction

2. Exchangeability and Finite Sampling Consistency

Definition 2.1**.**

Lemma 2.2**.**

Proof.

Definition 2.3**.**

Lemma 2.4**.**

Lemma 2.5**.**

Proof.

Example 2.6**.**

Lemma 2.7**.**

Proof.

Corollary 2.8**.**

Proof.

Example 2.9**.**

Lemma 2.10**.**

Proof.

3. Examples of Exchangeable and Sampling Consistent distributions

3.1. Markov Branching Models

Example 3.1**.**

Corollary 3.2**.**

Proof.

3.2. Multinomial model

Example 3.3**.**

4. Distributions in EX4∞EX_{4}^{\infty}EX4∞​

Theorem 4.1**.**

Theorem 4.2**.**

Proof.

5. Distributions on EX5∞EX_{5}^{\infty}EX5∞​

Definition 5.1**.**

Proposition 5.2**.**

Proof.

Definition 5.3**.**

Lemma 5.4**.**

Proof.

Lemma 5.5**.**

Proof.

Lemma 5.6**.**

Proof.

Theorem 5.7**.**

Proof.

Corollary 5.8**.**

Proof.

Corollary 5.9**.**

Proof.

6. Distributions on EXn∞EX_{n}^{\infty}EXn∞​

Theorem 6.1**.**

Proof.

Corollary 6.2**.**

Proof.

Proposition 6.3**.**

Proof.

Corollary 6.4**.**

Proof.

Acknowledgments

Definition 2.1.

Lemma 2.2.

Definition 2.3.

Lemma 2.4.

Lemma 2.5.

Example 2.6.

Lemma 2.7.

Corollary 2.8.

Example 2.9.

Lemma 2.10.

Example 3.1.

Corollary 3.2.

Example 3.3.

4. Distributions in $EX_{4}^{\infty}$

Theorem 4.1.

Theorem 4.2.

5. Distributions on $EX_{5}^{\infty}$

Definition 5.1.

Proposition 5.2.

Definition 5.3.

Lemma 5.4.

Lemma 5.5.

Lemma 5.6.

Theorem 5.7.

Corollary 5.8.

Corollary 5.9.

6. Distributions on $EX_{n}^{\infty}$

Theorem 6.1.

Corollary 6.2.

Proposition 6.3.

Corollary 6.4.