Mutations on a Random Binary Tree with Measured Boundary
Jean-Jil Duchamps, Amaury Lambert

TL;DR
This paper models mutations on infinite random binary trees with boundary measures, characterizing the allelic partition and clonal structures, and introduces a coalescent point process representation for such trees.
Contribution
It establishes a mapping of supercritical binary trees to coalescent point processes and characterizes the regenerative properties of the clonal boundary.
Findings
The boundary measure converges to a uniform measure on the ultrametric tree.
The clonal boundary forms a regenerative set with a specific structure.
The clonal subtree dynamics form a Markovian increasing tree process.
Abstract
Consider a random real tree whose leaf set, or boundary, is endowed with a finite mass measure. Each element of the tree is further given a type, or allele, inherited from the most recent atom of a random point measure (infinitely-many-allele model) on the skeleton of the tree. The partition of the boundary into distinct alleles is the so-called allelic partition. In this paper, we are interested in the infinite trees generated by supercritical, possibly time-inhomogeneous, binary branching processes, and in their boundary, which is the set of particles `co-existing at infinity'. We prove that any such tree can be mapped to a random, compact ultrametric tree called coalescent point process, endowed with a `uniform' measure on its boundary which is the limit as of the properly rescaled counting measure of the population at time . We prove that the clonal (i.e.,…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Mutations on a Random Binary Tree with Measured Boundary
Jean-Jil Duchamps111LPMA -UMR7599- UPMC Univ Paris 06 222CIRB -UMR7241- Collège de France and Amaury Lambert11footnotemark: 1 22footnotemark: 2
Abstract
Consider a random real tree whose leaf set, or boundary, is endowed with a finite mass measure. Each element of the tree is further given a type, or allele, inherited from the most recent atom of a random point measure (infinitely-many-allele model) on the skeleton of the tree. The partition of the boundary into distinct alleles is the so-called allelic partition.
In this paper, we are interested in the infinite trees generated by supercritical, possibly time-inhomogeneous, binary branching processes, and in their boundary, which is the set of particles ‘co-existing at infinity’. We prove that any such tree can be mapped to a random, compact ultrametric tree called coalescent point process, endowed with a ‘uniform’ measure on its boundary which is the limit as of the properly rescaled counting measure of the population at time .
We prove that the clonal (i.e., carrying the same allele as the root) part of the boundary is a regenerative set that we characterize. We then study the allelic partition of the boundary through the measures of its blocks. We also study the dynamics of the clonal subtree, which is a Markovian increasing tree process as mutations are removed.
Keywords and phrases: coalescent point process; branching process; random point measure; allelic partition; regenerative set; tree-valued process.
MSC2000 subject classifications: primary 05C05, 60J80; secondary 54E45; 60G51; 60G55; 60G57; 60K15; 92D10.
Contents
1 Introduction
In this paper, we give a new flavor of an old problem of mathematical population genetics which is to characterize the so-called allelic partition of a population. To address this problem, one needs to specify a model for the genealogy (i.e., a random tree) and a model for the mutational events (i.e., a point process on the tree). Two typical assumptions that we will adopt here are: the infinite-allele assumption, where each mutation event confers a new type, called allele, to its carrier; and the neutrality of mutations, in the sense that co-existing individuals are exchangeable, regardless of the alleles they carry. Here, our goal is to study the allelic partition of the boundary of some random real trees that can be seen as the limits of properly rescaled binary branching processes.
In a discrete tree, a natural object describing the allelic partition without labeling alleles is the allele frequency spectrum , where is the number of alleles carried by exactly co-existing individuals in the population. In the present paper, we start from a time-inhomogeneous, supercritical binary branching process with finite population at any time , and we are interested in the allelic partition of individuals ‘co-existing at infinity’ (), that is the allelic partition at the tree boundary. To define the analogue of the frequency spectrum, we need to equip the tree boundary with a measure , which we do as follows. Roughly speaking, if is the number of individuals co-existing at time in the subtree consisting of descendants of the same fixed individual , the measure is proportional to . It is shown in Section 5 that the tree boundary of any supercritical branching process endowed with the (properly rescaled) tree metric and the measure has the same law as a random real tree, called coalescent point process (CPP) generated from a Poisson point process, equipped with the so-called comb metric [21] and the Lebesgue measure. Taking this result for granted, we will focus in Sections 2, 3 and 4 on coalescent point processes with mutations.
In the literature, various models of random trees and their associated allelic partitions have been considered. The most renowned result in this context is Ewens’ Sampling Formula [13], a formula that describes explicitly the distribution of the allele frequency spectrum in a sample of co-existing individuals taken from a stationary population with genealogy given by the Moran model with population size and mutations occurring at birth with probability . When time is rescaled by and , this model converges to the Kingman coalescent [18] with Poissonian mutations occurring at rate along the branches of the coalescent tree. In the same vein, a wealth of recent papers has dealt with the allelic partition of a sample taken from a -coalescent or a -coalescent with Poissonian mutations, e.g., [5, 14, 15, 4].
In parallel, several authors have studied the allelic partition in the context of branching processes, starting with [16] and the monograph [24], see [10] and the references therein. In a more recent series of papers [19, 8, 9, 11], the second author and his co-authors have studied the allelic partition at a fixed time of so-called ‘splitting trees’, which are discrete branching trees where individuals live i.i.d lifetimes and give birth at constant rate. In particular, they obtained the almost sure convergence of the normalized frequency spectrum as [8] as well as the convergence in distribution of the (properly rescaled) sizes of the most abundant alleles [9]. The limiting spectrum of these trees is to be contrasted with the spectrum of their limit, which is the subject of the present study, as explained earlier.
Another subject of interest is the allelic partition of the entire progeny of a (sub)critical branching process, as studied in particular in [7]. The scaling limit of critical branching trees with mutations is a Brownian tree with Poissonian mutations on its skeleton. Cutting such a tree at the mutation points gives rise to a forest of trees whose distribution is investigated in the last section of [7], and relates to cuts of Aldous’ CRT in [3] or the Poisson snake process [1]. The couple of previously cited works not only deal with the limits of allelic partitions for the whole discrete tree, but also tackle the limiting object directly. This is also the goal of the present work, but with quite different aims.
First, we construct in Section 2 an ultrametric tree with boundary measured by a ‘Lebesgue measure’ , from a Poisson point process with infinite intensity , on which we superimpose Poissonian neutral mutations with intensity measure . Section 2 ends with Proposition 2.12, which states that the total number of mutations in any subtree is either finite a.s. or infinite a.s. according to an explicit criterion involving and .
The structure of the allelic partition at the boundary is studied in detail in Section 3. Theorem 3.3 ensures that the subset of the boundary carrying no mutations (or clonal set) is a (killed) regenerative set with explicit Laplace exponent in terms of and and measure given in Corollary 3.8. The mean intensity of the allele frequency spectrum at the boundary is defined by , where the sum is taken over all allelic clusters at the boundary. It is explicitly expressed in Proposition 3.11. An a.s. convergence result as the radius of the tree goes to infinity is given in Proposition 3.14 for the properly rescaled number of alleles with measure larger than , which is the analogue of in the discrete setting.
Section 4 is dedicated to the study of the dynamics of the clonal (mutation-free) subtree when mutations are added or removed through a natural coupling of mutations in the case when . It is straightforward that this process is Markovian as mutations are added. As mutations are removed, the growth process of clonal trees also is Markovian, and its semigroup and generator are provided in Theorem 4.2.
Section 5 is devoted to the links between measured coalescent point processes and measured pure-birth trees which motivate the present study. Lemma 5.5 gives a representation of every CPP with measured boundary, in terms of a rescaled pure-birth process with boundary measured by the rescaled counting measures at fixed times. Conversely, Theorem 5.6 gives a representation of any such pure-birth process in terms of a CPP with intensity measure , as in the case of the Brownian tree.
2 Preliminaries and Construction
2.1 Discrete Trees, Real Trees
Let us recall some definitions of discrete and real trees, which will be used to define the tree given by a so-called coalescent point process.
In graph theory, a tree is an acyclic connected graph. We call discrete trees such graphs that are labeled according to Ulam–Harris–Neveu’s notation by labels in the set of finite sequences of non-negative integers:
[TABLE]
with the convention .
Definition 2.1**.**
A rooted discrete tree is a subset of such that
- •
* and is called the root of *
- •
For and , we have .
- •
For and such that , for , we have and is called a child of .
For , the restriction of to the first generations is defined by:
[TABLE]
where denotes the length of a finite sequence. For , if there is such that , then is said to be an ancestor of , noted . Generally, let denote the most recent common ancestor of and , that is the longest word such that and . The edges of as a graph join the parents and their children .
For a discrete tree , we define the boundary of as
[TABLE]
and we equip with the -field generated by the family , where
[TABLE]
Remark 2.2**.**
With a fixed discrete tree , a finite measure on is characterized by the values . Reciprocally if the number of children of is finite for each , by Carathéodory’s extension theorem, any finitely additive map extends uniquely into a finite measure on .
By assigning a positive length to every edge of a discrete tree, one gets a so-called real tree. Real trees are defined more generally as follows, see e.g. [12].
Definition 2.3**.**
A metric space is a real tree if for all ,
- •
There is a unique isometry such that and ,
- •
All continuous injective paths from to have the same range, equal to
.
This unique path from to is written . The degree of a point is defined as the number of connected components of , so that we may define:
- •
The leaves of are the points with degree .
- •
The internal nodes of are the points with degree .
- •
The branching points of are the points with degree larger than .
One can root a real tree by distinguishing a point , called the root.
From this definition, one can see that for a rooted real tree , for all , there exists a unique point such that . We call the most recent common ancestor of and , noted . There is also an intrinsic order relation in a rooted tree: if , that is if , then is called an ancestor of , noted .
We will call a rooted real tree a simple tree if it can be defined from a discrete tree by assigning a length to each edge. From now on, we will restrict our attention to simple trees.
Definition 2.4**.**
A simple (real) tree is given by , where is a rooted discrete tree, and and are maps from to satisfying
[TABLE]
[TABLE]
Here and are called the birth time and death time of and is the life length of .
We will sometimes consider simple trees equipped with a measure on their boundary .
We call a reversed simple tree a triple where is a simple tree. We may sometimes omit the term “reversed” when the context is clear enough.
The restriction of to the first generations is the simple tree defined by
[TABLE]
One can check that a simple tree defines a unique real rooted tree defined as the completion of , with
[TABLE]
In particular, we have .
In this paper, we construct random simple real trees with marks along their branches. We see these trees as genealogical/phylogenetic trees and the marks as mutations that appear in the course of evolution. We will assume that each new mutation confers a new type, called allele, to its bearer (infinitely-many alleles model). Our goal is to study the properties of the clonal subtree (individuals who do not bear any mutations, black subtree in Figure 1) and of the allelic partition (the partition into bearers of distinct alleles of the population at some fixed time).
2.2 Comb Function
2.2.1 Definition
We now introduce ultrametric trees, using a construction with comb functions following Lambert and Uribe Bravo [21].
Definition 2.5**.**
Let and . Let also such that
[TABLE]
The pair will be called a comb function. For any real number , we define the ultrametric tree of height associated with as the real rooted tree which is the completion of , where is the skeleton of the tree, and Sk, and are defined by
[TABLE]
The set is called the origin branch of the tree.
For , we call the lineage of the subset of the tree defined as the closure of the set
[TABLE]
For one can define as the closure of the origin branch.
Remark 2.6**.**
One can check that is a distance which makes a real tree, and so its completion also is a real tree. Furthermore, the fact that is finite for all ensures that it is a simple tree, since the branching points in Sk are the points with . For a visual representation of the tree associated with a comb function, see Figure 2, where the skeleton is drawn in vertical segments and the dashed horizontal segments represent branching points.
Proposition 2.7**.**
With the same notation as in Definition 2.5, for a fixed comb function and a real number , writing for the associated real tree, the following holds. For each , there is a unique leaf such that
[TABLE]
Furthermore, the map is measurable with respect to the Borel sets of and .
Proof.
For , is defined as the closure of the origin branch . Since , the map
[TABLE]
is an isometry, and since is defined as the completion of the skeleton Sk, there is a unique isometry which extends . Therefore we define , which satisfies since is an isometry. Also is a leaf of because it is in . Indeed, since is the completion of Sk which is connected, is necessarily also connected, which means that has degree .
Now for a fixed , write for the (finite or infinite) sequence with values in
[TABLE]
defined inductively (as long as they can be defined) by and
[TABLE]
- •
If the sequence is well defined for all , then since is a comb function, we necessarily have that as .
- •
On the other hand, the sequence is finite if and only if it is defined up to an index such that either or is zero on the interval . In that case, we still define for convenience , for all .
Now it can be checked that we have
[TABLE]
and that is defined as the closure of the set
[TABLE]
Also, by definition of the sequence , the distance satisfies, for
,
[TABLE]
Therefore the following map is an isometry (and it is well defined because ).
[TABLE]
As in the case , this isometry can be extended to and we define . It is a leaf of satisfying for the same reasons as for [math].
It remains to prove that is measurable. It is enough to show that it is right-continuous, because in that case the pre-image of an open set is necessarily a countable union of right-open intervals, which is a Borel set. Now for , by taking limits along the lineages and , it is easily checked that the distance between and can be written
[TABLE]
and since is a comb function, necessarily we have
[TABLE]
Hence is right-continuous, therefore measurable. ∎
It follows from Proposition 2.7 that the Lebesgue measure on the real interval can be transported by the map to a measure on the tree , or more precisely on its boundary, that is the set of its leaves.
Definition 2.8**.**
With the same notation as in Definition 2.5 and Proposition 2.7, for any fixed comb function and , writing for the associated real tree, we define the measure on the boundary of as the measure
[TABLE]
which concentrates on the leaves of the tree. From now on, we always consider the tree associated with a comb function as a rooted real tree equipped with the measure on its boundary.
2.2.2 The Coalescent Point Process
Here we will consider the measured tree associated to a random comb function. Let be a positive measure on such that for all , we have
[TABLE]
and be the support of the Poisson point process on with intensity . Then we can define as the function whose graph is .
[TABLE]
Now fix such that and set
[TABLE]
Definition 2.9**.**
The ultrametric random tree associated to and is called coalescent point process (CPP) of intensity and height , denoted by CPP(). It is equipped with the random measure , concentrated on the leaves, which is the push-forward of the Lebesgue measure on by the map .
Note that a coalescent point process is not directly related to coalescent theory, a canonical example of which is Kingman’s coalescent [18], although there exist links between the two: it is shown in [20] that a CPP appears as a scaling limit of the genealogy of individuals having a very recent common ancestor in the Kingman coalescent.
Formally, a CPP is a random variable valued in the space of finitely measured compact metric spaces endowed with the Gromov-Hausdorff-Prokhorov distance defined in [2] as an extension of the more classical Gromov-Hausdorff distance. Actually, it is easy to check that all the random quantities we handle are measurable, since we are dealing with a construction from a Poisson point process.
2.3 Mutations on a CPP
Here we set up how mutations appear on the random genealogy associated with a CPP of intensity . Let be a positive measure on . We make the following assumptions:
[TABLE]
We will now define the CPP of intensity and height marked with rate .
Recall that the CPP is constructed from the support of a Poisson point process with intensity on and has a root . Define independently for each point of the Poisson point process of intensity on the interval . Each atom of is a mark on the branch at height . The family therefore defines a point process on the skeleton of the CPP tree:
[TABLE]
By definition, conditional on Sk, is a Poisson point process on Sk whose intensity is such that for all non-negative real numbers and , we have:
[TABLE]
Definition 2.10**.**
Let be measures satisfying assumption (H). A coalescent point process with intensity , mutation rate and height , denoted CPP(), is defined as the random CPP() given by , equipped with the point process on its skeleton.
- (i)
The clonal subtree of the rooted real tree equipped with mutations is defined as the subset of formed by the points :
[TABLE]
Equipped with the distance induced by , this is also a real tree. 2. (ii)
Given the (ultrametric) rooted real tree equipped with mutations and the application from the real interval to whose range is included in the leaves of , we can define the clonal boundary (or clonal population) :
[TABLE]
Remark 2.11**.**
This set is studied in a paper by Philippe Marchal [22] for a CPP with and mutations at branching points with probability . In that case the sets have the same distribution as the range of a -stable subordinator. In the present case of Poissonian mutations, is not stable any longer but we will see in Section 3 that it remains a regenerative set.
Total number of mutations.
Since is a locally finite measure on , the number of mutations on a fixed lineage of the CPP() is a Poisson random variable with parameter , and so is a.s. finite. However, it is possible that in a clade (here defined as the union of all lineages descending from a fixed point), there are infinitely many mutations with probability . For instance, if is the Lebesgue measure and if is such that
[TABLE]
we know from the properties of Poisson point processes that the total length of any clade is a.s. infinite. In this case, the number of mutations in any clade is also a.s. infinite so that each point in the skeleton of the tree has a.s. at least one descending lineage with infinitely many mutations. Such a lineage can be displayed by choosing iteratively at each branching point a sub-clade with infinitely many mutations.
One can ask under which conditions this phenomenon occurs. Conditional on the tree of height , the total number of mutations follows a Poisson distribution with parameter
[TABLE]
where is the first time such that there is a point of with height larger than . Indeed, the origin branch is of height and the heights of the other branches are the heights of points of . This number of mutations is finite a.s. on the event and infinite a.s. on its complement. But by the properties of Poisson point processes, two cases are distinguished: either has probability [math] or it has probability .
Proposition 2.12**.**
There is the following dichotomy:
[TABLE]
In the former case, the total number of mutations has mean
[TABLE]
Proof.
Conditional on , the set is the support of a Poisson point process on with intensity . Therefore, from basic properties of Poisson point processes, conditional on , is finite a.s. if and only if
[TABLE]
and since is finite a.s. and is increasing, this condition is equivalent to the condition of the proposition. Now let us write for the total number of mutations. The conditional distribution of given is a Poisson distribution with mean . Therefore we deduce
[TABLE]
which concludes the proof. ∎
3 Allelic Partition at the Boundary
In this section, we will identify the clonal boundary in a mutation-equipped CPP, that is the set of leaves of the tree which do not carry mutations, and characterize the reduced subtree generated by this set.
3.1 Regenerative Set of the Clonal Lineages, Clonal CPP
Denote by a CPP() where satisfy assumptions . A leaf of is said clonal if it carries the same allele as the root. Recall the canonical map from the real interval to the leaves of (see Proposition 2.7). The clonal boundary (see Definition 2.10) of is then the set defined as the pre-image of the clonal leaves by the map .
We define the event
[TABLE]
that there is no mutation on the origin branch of . Note that this event has a positive probability equal to . By definition, the point process of mutations on the origin branch is independent of . Therefore conditioning on amounts to considering the tree equipped with the mutations on its skeleton which are given only by the point processes . We now define a random set , whose distribution depends only on and not on , which will allow the characterization of the clonal boundaries conditional on the event .
Definition 3.1**.**
Recall the notations and . For each fixed , let be the (possibly finite) sequence of points of such that
[TABLE]
with the convention , and where the sequence is finite if there is a such that . We define the following random point measure on :
[TABLE]
Now we define the random set as:
[TABLE]
Remark 3.2**.**
Recall that for a comb function and a real number , in the proof of Proposition 2.7, we defined a sequence in the same way as in the previous definition and we remarked that the lineage of is the closure of the set
[TABLE]
It follows that in the case of the tree equipped with the mutations on its skeleton, we have the equality between events
[TABLE]
Therefore, on the event , the clonal boundary of the tree coincides with the restriction of to the interval , which explains why we study the set .
The subtree of spanned by the clonal boundary is called the reduced clonal subtree and defined as
[TABLE]
Note that it is a Borel subset of because it is the closure of
[TABLE]
where is the finite set . The set is proven to be a regenerative set (see Appendix A.3 for the results used in this paper and the references concerning subordinators and regenerative sets), and the reduced clonal subtree is shown to have the law of a CPP.
Theorem 3.3**.**
The law of and of the associated reduced clonal subtree can be characterized as follows.
- (i)
Under the assumptions (H) and with the preceding notation the random set is regenerative. It can be described as the range of a subordinator whose Laplace exponent is given by:
[TABLE] 2. (ii)
The reduced clonal subtree, that is the subtree spanned by the set , has the distribution of a CPP with intensity , where is the positive measure on determined by the following equation. Letting and , we have, for all ,
[TABLE]
Remark 3.4**.**
The last formula of the theorem is an extension of Proposition 3.1 in [19], where the case when is a finite measure and is treated. Here, we allow to have infinite mass and to take a more general form (provided (H) is satisfied).
Regenerative set.
Here, we prove the first part of the theorem concerning .
Proof of Theorem 3.3, (i).
Let be the natural filtration of the marked CPP defined by:
[TABLE]
To show first that is -progressively measurable, we show that for a fixed , the set
[TABLE]
is in . Basic properties of Poisson point processes ensure there exists an -measurable sequence of random variables giving the coordinates of the mutations in . Let be such a sequence, for instance ranked such that is decreasing as in Figure 4. We also define the following -measurable random variables:
[TABLE]
Now we have
[TABLE]
which proves that the random set is -progressively measurable, and almost-surely left-closed.
Let us now show the regeneration property of . Define
[TABLE]
the maximal height of atoms of between and . We will note for simplicity. Remark that
[TABLE]
Let be a -stopping time, and suppose that almost surely, , and is not isolated to the right. From elementary properties of Poisson point processes and the fact that the random variables are i.i.d, we know that the tree strictly to the right of is independent of and has the same distribution as the initial tree. Now since almost surely, we have, for all ,
[TABLE]
because , in other words there are no mutations on the lineage of that is also part of the lineage of . As a consequence,
[TABLE]
which implies that has the same distribution as and is independent of .
Therefore it is proven that has the regenerative property, so one can compute its Laplace exponent. Here we are in the simple case where has a positive Lebesgue measure, and we have in particular, for all ,
[TABLE]
The passage from the second to the third line is done integrating by parts thanks to the assumption that is continuous and that has an infinite mass. The last displayed expression is therefore the density with respect to the Lebesgue measure of the renewal measure of (see Remark A.9). This is sufficient to characterize our regenerative set, and the expression given in the Proposition is found by computing the Laplace transform of this measure:
[TABLE]
which concludes the proof of (i). ∎
Remark 3.5**.**
*It is important to note that the particular case of a CPP with intensity has the distribution of a (root-centered) sphere of the so-called Brownian CRT (Continuum Random Tree), the real tree whose contour is a Brownian excursion. This is shown for example by Popovic in [23] where the term ‘Continuum genealogical point process’ is used to denote what is called here a coalescent point process. The measure is the push-forward of the Brownian excursion measure by the application which maps an excursion to its depth. In general, the sphere of radius say of a totally ordered tree is an ultrametric space whose topology is characterized by the pairwise distances between ‘consecutive’ points at distance from the root. When the order of the tree is the order associated to a contour process, these distances are the depths of the ‘consecutive’ excursions of the contour process away from , see e.g. Lambert and Uribe Bravo [21].
If in addition to , we assume that , which amounts to letting Poissonian mutations at constant rate on the skeleton of the CRT, we have*
[TABLE]
In particular, for all , we can compute:
[TABLE]
This implies the equality in distribution . Nevertheless is not a so-called ‘stable’ regenerative set, contrary to the sets in [22].
Reduced clonal subtree.
To show that the reduced clonal subtree is a CPP, let us exhibit the Poisson point process that generates it. Let be the subordinator with drift whose range is and let be the following point process:
[TABLE]
where . This point process generates the reduced clonal subtree, because is (up to a factor ) the tree distance between the consecutive leaves and in . To complete the proof of the theorem, it is sufficient to show that conditional on the death time of the subordinator , is a Poisson point process on with intensity .
Proof of Theorem 3.3, (ii).
This is due to the regenerative property of the process. For fixed , is a -stopping time which is almost surely in on the event . This implies that conditional on , the marked CPP strictly to the right of is equal in distribution to the original marked CPP and is independent of . In particular:
[TABLE]
This implies that has the same distribution as and is independent of . For fixed , let be the sequence of atoms of such that , ranked with increasing . Then is a -stopping time and the sequence is i.i.d., with for convenience. It is sufficient to observe that is an exponential random variable to show that has an intensity of the form :
[TABLE]
It remains to characterize the measure by computing . Note that the following computations are correct thanks to the assumption that has no atom, so that is continuous. To simplify the notation, let . Then we can compute:
[TABLE]
Letting , we have
[TABLE]
Now , hence
[TABLE]
which concludes the proof. ∎
Remark 3.6**.**
Equality (2) becomes, letting ,
[TABLE]
Remark 3.7**.**
In Remark 3.5, we explained that when the contour of a random tree is a strong Markov process as in the case of Brownian motion, the root-centered sphere of radius of this tree is a CPP. In addition, the intensity measure of this CPP is the measure of the excursion depth under the excursion measure of the contour process (away from . Let denote the excursion measure of the process away from [math], with a Brownian motion with drift , and let denote the depth of the excursion. In the case and , we have
[TABLE]
This is consistent with Proposition 4 in [1], which shows that putting Poissonian random cuts with rate along the branches of a standard Brownian CRT yields a tree whose contour process is stopped at the first return at [math], where is the normalized Brownian excursion.
3.2 Measure of the Clonal Population
Recall that for a CPP(), conditional on (no mutation on the origin branch), the Lebesgue measure is equal to the measure of the set of clonal leaves .
Corollary 3.8**.**
Let be two measures satisfying assumptions (H).
- (i)
With the notation of Theorem 3.3, the random variable follows an exponential distribution with mean . 2. (ii)
In a CPP(), conditional on , the measure of the set of clonal leaves is an exponential random variable of mean .
Proof.
Given a subordinator with drift and range , it is known (a quick proof of this can be found in [6]) that
[TABLE]
Now the killing time of the subordinator is an exponential random variable of parameter , where is the Laplace exponent of . We already know from Remark 3.6 the mean of that variable:
[TABLE]
With a fixed height , one is interested in the law of . By the properties of Poisson point processes, stopping the CPP at amounts to changing the intensity measure of the CPP for , with
[TABLE]
Then if , we have
[TABLE]
and because of the characterization of given in Theorem 3.3, we also have . Therefore , and we can conclude that is an exponential random variable of mean . ∎
Probability of clonal leaves.
Here, we consider a CPP() and aim at computing the probability of existence of clonal leaves in the tree.
Proposition 3.9**.**
In a CPP(), under the assumptions (H) and with the notation of Theorem 3.3, there is a mutation-free lineage with probability
[TABLE]
Remark 3.10**.**
Using a description of CPP trees in terms of birth-death trees (see Section 5), the previous result could alternatively be deduced from the expression of the survival probability of a birth-death tree up to a fixed time (see Proposition A.1 in the appendix).
Proof.
Suppose the CPP() is given by the usual construction with the Poisson point processes and . We use the regenerative property of the process with respect to the natural filtration of the marked CPP defined by:
[TABLE]
Let be the first clone on the real half-line.
[TABLE]
with the convention and with the usual notation. Then is a -stopping time, and conditional on , the law of the tree on the right of is the same as that of the original tree conditioned on having no mutation on the origin branch. Let denote the event of existence of a mutation-free lineage. Recall that denotes the set of clonal leaves and that denotes the event that there is no mutation on the origin branch. Then we have
[TABLE]
where the last equality is due to Corollary 3.8 (ii). Furthermore,
[TABLE]
Therefore, the probability that there exists a clone of the origin in the present population is
[TABLE]
which concludes the proof. ∎
3.3 Application to the Allele Frequency Spectrum
3.3.1 Intensity of the Spectrum
From now on we fix two measures satisfying assumptions (H), and we further assume for simplicity that for all . We denote by a CPP().
Under the infinitely-many alleles model, recall that each mutation gives rise to a new type called allele, so that the population on the boundary of the tree can be partitioned into carriers of the same allele, called allelic partition. The key idea of this section is that expressions obtained for the clonal population of the tree allow us to gain information on quantities related to the whole allelic partition. We call a mutation if and denote by the subtree descending from . If is a functional of real trees (say simple, marked, equipped with a measure on the leaves), one might be interested in the quantity
[TABLE]
or in its expectation
[TABLE]
For each mutation , we define the set of the leaves carrying as their last mutation
[TABLE]
We define the random point measure putting mass on the measures of the different allelic clusters
[TABLE]
The intensity of the allele frequency spectrum is the mean measure of this point measure, that is the measure on such that for every Borel set of ,
[TABLE]
The analog for this measure when the number of individuals in the population is finite is the mean measure of the number of alleles carried by exactly individuals (notation in [19] and [8]). The goal here is then to identify , by noticing that for a Borel set ,
[TABLE]
with , where is an ultrametric tree with point mutations and measure supported by its leaves, and denotes the set of its clonal leaves.
Proposition 3.11**.**
In a CPP(), under the assumptions (H) and with the notation of Theorem 3.3, the intensity of the allele frequency spectrum has a density with respect to the Lebesgue measure:
[TABLE]
Remark 3.12**.**
This expression is to be compared with Corollary 4.3 in [8] (the term with discrete becoming here with continuous ).
Remark 3.13**.**
Integrating this expression, we get the expectation of the number of different alleles in the population:
[TABLE]
Note that is the expectation of the total mass of the measure in a CPP(). It is then natural to normalize by this quantity and then let . In (H) we assumed that , and since is an increasing, positive function of , we have clearly when . Therefore we have
[TABLE]
This provides us with a limiting spectrum intensity, written simply :
[TABLE]
Note that in the Brownian case , we get a simple expression .
Proof of Proposition 3.11.
We aim at computing , for a measurable non-negative function of a simple real tree with point mutations equipped with a measure on its leaves. Suppose the mutations on the tree are numbered by increasing distances from the root. Here we use the fact that a CPP can be seen as the genealogy of a birth-death process (see Section 5 for the development of this argument), a Markovian branching process whose time parameter is the distance from the root. This description implies that, for all , conditional on the height of mutation , the subtree growing from has the law of . Set
[TABLE]
Denoting the height of the -th mutation of , we get
[TABLE]
Now this expression is simple to compute knowing and the intensity of the point process giving mutation heights. Indeed, by elementary properties of Poisson point processes
[TABLE]
Now consider, for a fixed , the function given by , where is a generic ultrametric tree with point mutations and measure supported by its leaves, and denotes the set of its clonal leaves. This allows us to compute the expectation of the number of mutations carried by a population of leaves of measure greater than . Since the law of the measure of clonal leaves is known for a CPP, (see Corollary 3.8), we deduce
[TABLE]
where again denotes the event of existence of clonal leaves in and is the set defined in Definition 3.1. Thus we have
[TABLE]
Differentiating the last quantity yields the expression in the Proposition. ∎
3.3.2 Convergence Results for Small Families
Recall the construction of a CPP from a Poisson point process in Section 2.2, and the point processes of mutations . Since a CPP() is given by the points of with first component smaller than , this construction yields a coupling of , where for each , is a CPP(). Recall the notation from the previous subsection. Then, similarly to Theorem 3.1 in [19], we have the following almost sure convergence.
Proposition 3.14**.**
Under the preceding assumptions, and further assuming , for any , we have the convergence:
[TABLE]
Remark 3.15**.**
Recall that is the number of alleles carried by a population of leaves of measure larger than in the tree , and is the total size of the population of . The result is a strong law a large numbers: it shows that the number of small families (with a fixed size) grows linearly with the total measure of the tree at a constant speed given by the measure defined by (4) as the limiting allele frequency spectrum intensity.
Proof.
We will use the law of large numbers several times. Let us first introduce some notation. For , define as the increasing sequence of first components of the atoms of with second component larger than , that is and for any
[TABLE]
For , let , that is the unique number such that
[TABLE]
Notice that the assumptions for all and imply that and as , for a fixed . Because the times are i.i.d. exponential random variables with mean and since we have
[TABLE]
it is clear by the strong law of large numbers that
[TABLE]
Also, write for the sequence of subtrees of height within that are separated by the branches higher than . That is, is the ultrametric tree generated by the points of with first component between and . From basic properties of Poisson point processes, they are i.i.d. and their distribution is that of .
Now, write for the height of an ultrametric tree (i.e., the distance between the root and any of its leaves), and take any non-negative, measurable function of simple trees, such that
[TABLE]
Recall the definition of . Since satisfies ( ‣ 3.3.2), we can write
[TABLE]
Therefore, again by the strong law of large numbers, we have the following convergence
[TABLE]
Combining the two convergence results, it follows that
[TABLE]
Let us apply this to the function . This function does not satisfy ( ‣ 3.3.2) for any , so we cannot apply (6) directly because (5) does not hold. However, we can artificially truncate by defining the restriction :
[TABLE]
which does satisfy ( ‣ 3.3.2). Now since , we have the inequality between random variables
[TABLE]
and by taking limits,
[TABLE]
But we have and as a consequence of Proposition 3.11, we have
[TABLE]
which is by definition. Therefore, we now have the inequality
[TABLE]
The converse inequality stems from a simple remark. There are at most mutations of height greater than giving rise to an allele carried by some leaves of . This is simply because a population of individuals can exhibit at most different alleles. Therefore, we have
[TABLE]
which gives by taking limits
[TABLE]
We can finally conclude
[TABLE]
which is the announced result. ∎
4 The Clonal Tree Process
In this section we consider the clonal subtree of a random tree with distribution CPP(), where are measures satisfying assumptions (H) and . We further assume , that is we ignore the case when is a finite tree almost surely. We will focus on the case when .
4.1 Clonal Tree Process
There is a natural coupling in of the Poisson processes of mutations, in such a way that the sets of mutations are increasing in for the inclusion. Let denote a Poisson point process with Lebesgue intensity on , and define for ,
[TABLE]
Then is a Poisson point process on with intensity , and the sequence of supports of increases with . Let us use this idea to couple mutations with different intensities on the random tree . Recall the construction of a CPP with a Poisson point process in Section 2. For each point of , let be a Poisson point process on with Lebesgue intensity. For fixed , we get the original construction with when considering
[TABLE]
Therefore a natural coupling of mutations of different intensities is defined on the random tree . Denote the clonal subtree of height at mutation level , that is the subtree of defined by
[TABLE]
It is natural to seek to describe the decreasing process of clonal subtrees . As increases, it is clearly a Markov process since the distribution of given is the law of the clonal tree obtained after adding mutations at a rate along the branches of . We will now study the Markovian evolution of the time-reversed process, as decreases. Its transitions are relatively simple to describe using grafts of trees.
4.2 Grafts of Real Trees
Given two real rooted trees , and a graft point , one can define the real rooted tree that is the graft of the root of on at point by
[TABLE]
with the new distance defined as follows. For any ,
[TABLE]
and
[TABLE]
For real simple trees, this graft has a nice representation when the graft point is a leaf of the first tree.
Definition 4.1**.**
For a simple tree , define the buds of as the set of leaves of that live a finite time
[TABLE]
For two simple trees with , and for , we define the graft of on on the bud , denoted by:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
It is then clear that . See Figure 5 for an example.
4.3 Evolution of the Clonal Tree Process
We study the increasing clonal tree process as we remove mutations (decreasing ). We therefore reverse time by denoting , and defining . Denote the distribution of with values in the set of reversed (i.e., with time flowing from to 0) simple binary trees. See Figure 6 for a sketch of the tree growth process. The increasing process is nicely described in terms of grafts.
Theorem 4.2**.**
- (i)
The process is a time-inhomogeneous Markov process, whose transitions conditional on can be characterized as follows.
- •
The buds of are the leaves of height . Independently of the others, each bud is given an exponential clock of parameter .
- •
At time , a tree is grafted on the bud , following the distribution , and each newly created bud is given an independent exponential clock of parameter . 2. (ii)
The infinitesimal generator evaluated at a function of simple trees which depends only on a finite number of generations (i.e. such that the property holds) can be written as follows
[TABLE]
where is the random tree drawn under the probability measure . 3. (iii)
Write for the first time the clonal tree process reaches the boundary, that is the first time there is a leaf with , (where is the distance in the real tree ):
[TABLE]
Then the distribution of is given by
[TABLE]
where as previously , and
[TABLE]
that is with .
We first state a result that is already interesting in itself, which ensures that CPP trees are reversed pure-birth trees (see next Section for details on birth-death trees and their links with CPPs). We refer the reader to Subsection A.2, where a more general result is proved.
Lemma 4.3**.**
Let and be diffuse measures on , satisfying assumptions (H) and . Fix such that and let . Then for , a CPP is the genealogy of a reversed (i.e. with time flowing from to [math]) pure-birth process with birth intensity defined as the Laplace-Stieltjes measure associated with the nondecreasing function , started from .
Proof of Theorem 4.2.
From Lemma 4.3, we can express the CPP in terms of a pure-birth tree, with time flowing from to [math] (but measured from 0 to ) and birth intensity . Let denote the complete binary tree
[TABLE]
Then we can define recursively by setting , and for , with :
[TABLE]
with the convention , and where are i.i.d. Poisson point processes on with intensity . This defines the random reversed simple tree as the genealogy of a pure-birth process with birth intensity , with time flowing from to [math]. In other words, by the definition of , is the reversed simple tree with distribution CPP().
Now we define independently of , a family of i.i.d. Poisson point processes on with Lebesgue intensity. Writing for and ,
[TABLE]
we define a coupling of point processes with intensity on the branches of .
Now let us define the process by , with
[TABLE]
By definition, one can check that is the clonal simple tree associated with the tree and the point process of mutations . Therefore has the same distribution as . We define the filtration as the natural filtration of the process , which we may rewrite:
[TABLE]
From our definitions, for , we have:
[TABLE]
and since and are independent Poisson point processes, it is known that conditional on , we have: and are independent Poisson point processes, with intensity Lebesgue for and for , on their respective domains.
We can further notice that on the event , conditional on , the families of point processes
[TABLE]
are independent families of independent Poisson point process with intensity Lebesgue for and for , on their respective domains.
Also, since and are independent and with diffuse intensities, we have the a.s. equalities between events
[TABLE]
Moreover, since is a Poisson point process with Lebesgue intensity on , it is known that on this event, conditional on , the point process restricted to has the conditional distribution of:
[TABLE]
where is a uniform random variable on and is an independent Poisson point process on with Lebesgue intensity. Hence on the event , the distribution of
[TABLE]
is given by
[TABLE]
And so if is a bud of , the first time such that is lower than satisfies that has an exponential distribution with parameter .
We may now prove the first point (i) of the theorem. Fix , and write for the distinct buds of . We define, for and :
[TABLE]
This definition formulates that for , is the unique simple tree such that for another simple tree in which is a bud, with . Note that when writing , may be different from , even for arbitrarily close to , since other grafts may have occurred (possibly infinitely many grafts if has infinitely many buds).
Since are the buds of , the sets are disjoint. Thus, from our construction, the following families of random variables are independent conditional on :
[TABLE]
Furthermore, we know how to describe their distributions conditional on because of the previous observations. It follows that the trees are independent conditional on and the distribution of can be described by:
There is a random variable such that
- •
is exponentially distributed with parameter .
- •
For , we have so is the empty tree (or rather contains only one point, the root).
- •
Conditionally on , the process is distributed as our construction of the process , with the initial condition .
This concludes the proof of (i).
For (ii), write for the set of simple binary trees and suppose we have a bounded measurable map and a number such that
[TABLE]
Consider a fixed tree . There is a finite number of buds in the first generations , therefore for a fixed , conditional on , the process is a continuous time Markov chain. It follows from (i) that this Markov chain jumps after an exponential time with parameter to a new state where one of the buds, uniformly chosen, grows into a new tree. That is, denoting the infinitesimal generator of the process ,
[TABLE]
where is the random tree drawn under the probability measure .
For (iii), note that the existence of a leaf in the clonal subtree at a distance from the root coincides a.s. with the existence of a clonal leaf in , where is the original CPP() with mutation measure . Then the formula in the proof follows from Proposition 3.9, which gives the probability that there is a clonal leaf in a CPP. ∎
The branching random walk of the buds.
Forgetting the structure of the tree and considering only the height of the buds, the process becomes a rather simple branching random walk. Write for the point measure on giving the heights of the buds in . Then is a branching Markov process where each particle stays at their height during their lifetime (an exponential time of parameter ), then splits at their death time according to the distribution of . Similarly to the preceding paragraph, one can describe the infinitesimal generator of this process as follows. For a map that is zero in a neighborhood of [math] and a Radon point measure on (i.e. such that ), write for the sum
[TABLE]
Then the infinitesimal generator at time of the time-inhomogeneous process , evaluated at , is given by
[TABLE]
5 Link between CPP and Birth-Death Trees
5.1 Birth-Death Processes
An additional well-known example of random tree is given by the genealogy of a birth-death process, which will appear as an alternative description of our CPP trees. Here, a birth-death process is a time-inhomogeneous, time-continuous Markovian branching process living in with jumps in . In a general context, we will define the genealogy of a birth-death process as a random simple tree, which we may equip with a canonical limiting measure on the set of its infinite lineages.
Let be a real interval, with . Suppose there are two measures on , and , respectively called the birth intensity measure and death intensity measure, or simply birth rate and death rate, which satisfy for all
[TABLE]
In other words, and are diffuse Radon measures on .
Informally, the population starts with one individual at time , and each individual alive at time may give birth to a new individual at rate , and die at rate .
Definition 5.1**.**
Let be a real interval, with , and and measures on satisfying (7). Independently for each , we define and two independent point processes, such that (resp. ) is a Poisson point process on with intensity (resp. ).
The genealogy of a birth-death process started from is the random binary simple tree defined recursively by:
, with . 2. 2.
For each , we set , and . Then there are three different possibilities:
- •
if , then we set , and ,
- •
if , then we set , and ,
- •
if , then we set , and .
Birth-death processes have been known for a long time. They have been studied thoroughly as early as 1948 [17]. In the case of pure-birth processes with infinite descendance, we introduce a canonical measure on the boundary of the tree.
Definition 5.2**.**
Under the assumption and , the tree is said to be the genealogy of a pure-birth process. It may then be equipped with a measure on its boundary defined by
[TABLE]
where is defined as in Definition 2.4, and is the number of descendants of at time :
[TABLE]
Remark 5.3**.**
The limits in the definition are well-defined because for each , conditional on , the process is a non-negative martingale. Also, the fact that the map is additive combined with Remark 2.2 justifies that the measure is well defined.
Finally, let us introduce random mutations on a birth-death tree as a random discrete set of points.
Definition 5.4**.**
Let be a diffuse Radon measure on , and let denote the counting measure on . A birth-death tree may be equipped with a set of neutral mutations at rate by defining, independently of the preceding construction, a Poisson point process on with intensity , and then defining:
[TABLE]
This point process is then a discrete subset of the skeleton of the real tree (defined as in (1)) associated with .
Example. The Yule tree is the genealogy of a pure-birth process with and a birth rate equal to the Lebesgue measure, which means that the branches separating two branching points are i.i.d exponential random variables with parameter . Every pure-birth tree with can be time-changed into a Yule tree, with the time-change (see Proposition A.3).
5.2 Link between CPP and Supercritical Birth-Death Trees
We first provide a refined version of Lemma 4.3 which is proved in Subsection A.2.
Lemma 5.5**.**
Under the assumptions of Lemma 4.3, the CPP with boundary measured by is the genealogy of a reversed pure-birth process with birth intensity started from , with boundary measured by .
Let be a real interval, with , and let and be diffuse Radon measures on , i.e. measures satisfying (7). Consider a birth-death process started from with birth rate and death rate . Let us define
[TABLE]
In a birth-death process with , we say that an individual alive at time has an infinite progeny if for any time . It is known (see [17]) that the process is supercritical (i.e., the event has positive probability) if and only if , and that the probability of non-extinction for a process started at time is then . Also, if the birth-death process with rates is supercritical, then conditional on non-extinction, the subtree of individuals with infinite progeny is a pure-birth tree with birth rate .
Now we assume Poissonian neutral mutations are set on the genealogy of a supercritical birth-death process, according to a rate , where is a diffuse Radon measure on . We also assume so that conditional on non-extinction. Conditional on non-extinction, the subtree of individuals with infinite progeny is a measured simple tree equipped with mutations , where:
- •
is a random simple binary tree constructed (see Definition 5.2) from a pure-birth process with birth rate .
- •
With a Poisson point process on with intensity , the mutations on the branches of are defined as the set
[TABLE]
One may study this measured tree with mutations as the limit in time of the genealogy of the birth-death process with neutral mutations. We show that this measured tree with mutations is in fact a time-changed CPP tree.
Theorem 5.6**.**
Let be a real interval, with , and let and be diffuse Radon measures on , with . Let be a random measured simple tree representing the genealogy of a pure-birth process with rate started from , equipped with mutations at rate . Let be the time-change defined by
[TABLE]
*Then the time-changed tree (see Proposition A.3) has the distribution of a
CPP.*
Proof.
Thanks to Lemma 5.5, we only need to exhibit a correct time change to prove the Theorem. We know that a time-changed birth-death tree is still a birth-death tree: this is explicitly stated in Proposition A.3 in the appendix. This implies here that the time-changed tree is a (reversed) pure-birth process with birth rate , started from , and equipped with mutations with rate . Let us first check that . Since is diffuse, is continuous decreasing, so for all , we have , where is the right-continuous inverse of . Therefore we have, for all :
[TABLE]
Now notice that for ,
[TABLE]
so according to Lemma 5.5, a CPP is a pure-birth process with birth rate , started from and equipped with mutations at rate . Therefore its distribution is identical to the distribution of . ∎
Acknowledgements.
The authors thank the Center for Interdisciplinary Research in Biology (Collège de France) for funding.
Appendix A Appendix
A.1 Birth-Death Processes
Proposition A.1**.**
Let be a real interval, with , and and diffuse Radon measures on (i.e. satisfying (7)). Let denote the distribution of the genealogy of a birth-death process started with one individual at time , and let be the number of individuals alive at time . For and , we have:
[TABLE]
and in particular,
[TABLE]
Remark A.2**.**
Note that the previous proposition shows that conditional on being non-zero, is a geometric random variable, which is a known fact about birth-death processes (see for instance [17]). We still provide a proof in our case where the birth and death intensity measures are not necessarily absolutely continuous with respect to Lebesgue.
Proof.
With a fixed time horizon and a fixed real number , write for ,
[TABLE]
We use a different description of the birth-death process than the one used in Section 5, and consider a population where individuals die at rate , and during their lifetime, produce a new individual at rate . Notice that for any , the number of individuals alive at time has the same distribution in both models.
Thus we write for the death time of the first individual, and for the possible birth time of her -th child. With our description, has the distribution of the first atom of a Poisson point process on with intensity and conditional on , the set is a Poisson point process on with intensity . Also, write for the number of alive descendants of the -th child at time . Since we have , we have
[TABLE]
where we define by convention if . Now conditional on and , are independent, with equal to the distribution of under . Hence
[TABLE]
where we use the convention if . Now conditional on , are the atoms of a Poisson point process with intensity on , so we have
[TABLE]
which implies by differentiation
[TABLE]
which in turn may be rewritten
[TABLE]
Remark that with , we have , so that
[TABLE]
and since , we have by integration on :
[TABLE]
that is
[TABLE]
This characterizes the distribution of under for all . In particular, letting , we get
[TABLE]
which concludes the proof. ∎
Proposition A.3** (Time-changed birth-death processes).**
Let be a real interval, with , and , , and diffuse Radon measures on (i.e. satisfying (7)). Let be an increasing function, and define , and . We assume that satisfies
[TABLE]
Let be the genealogy of a birth-death process, started at and equipped with Poissonian mutations with rate , as in Definition 5.4. We define the time-changed simple tree:
[TABLE]
If and (the push-forwards of and by ) still have no atoms, then has the distribution of the genealogy of a birth-death process, started at and equipped with Poissonian mutations with rate .
Also, if and , then and , and the measures and on , defined for and for , are the same.
Proof.
Suppose is constructed as in Definition 5.1 with independent Poisson point processes and with respective intensities and , for each . This implies that the random sets defined by
[TABLE]
are independent Poisson point processes on the interval with respective intensities and . Remark that by assumption, for , for all , we have , so we a.s. have and . Now since is independent of and , we have also a.s.
[TABLE]
By definition, we have and , so . Then, if , with , and , the following assertions hold.
- •
Since we have (8), we know that a.s. for all , we have . This ensures that .
- •
For the same reason, we have .
- •
Because is independent of and because and are diffuse by assumption, we have almost surely. Therefore, we have:
- –
, which implies , and ,
- –
, which implies , and ,
- –
, which implies , and .
Thus is defined as a birth-death process, started at .
For the neutral mutations, we assume there is, as in Definition 5.4, a Poisson point process on with intensity , and such that:
[TABLE]
Now is a Poisson point process on with intensity , so
[TABLE]
is the definition of random neutral mutations at rate on the tree .
It remains to prove that in the case and , the measures and are the same. By definition, we have for ,
[TABLE]
where is the number of descendants of in the time-changed tree at time . But we have a.s. for all , , and also , so finally
[TABLE]
which ends the proof. ∎
Proposition A.4** (Characterization of pure-birth processes).**
Let be a real interval, with , and a diffuse Radon measure on , such that .
There is a unique family of distributions on simple trees equipped with a measure on , such that for all
- (i)
* and -almost surely.* 2. (ii)
. 3. (iii)
Under , is an exponential r.v. with mean . 4. (iv)
Under , define for , , , the measure on such that for all and finally . Then the conditional distribution of the pair of trees given is , i.e. they are independent with the same distribution .
Furthermore, for all , is the distribution of the genealogy of a pure-birth process with birth rate started with one individual at time , equipped with the measure on introduced in Definition 5.2.
Proof.
Let be the law of the genealogy of a pure-birth process started from . We will first show that the family satisfies the assertions (i)-(iv) of the theorem.
(i) By definition . Also, the fact that for all , , implies that for each Poisson point process with intensity on , there are infinitely many points in . This implies that each individual in the process will eventually split into two, so that -almost surely.
(ii) Under , is distributed as the first point of a Poisson point process on with intensity . Therefore,
[TABLE]
(iii) By Proposition A.1, writing for the expectation under , we have for ,
[TABLE]
Replacing by and letting , we have by dominated convergence:
[TABLE]
which implies that is an exponential random variable with mean .
(iv) Let us define a family of independent Poisson point processes on with intensity . Let us write for the deterministic function such that for all , is the simple tree constructed as in Definition 5.1, which follows the distribution . By assumption, the two families and are independent, and by construction, we have
[TABLE]
where and are defined as in the statement of the Proposition. Therefore, under , the conditional distribution of given is .
Now, let us show that if a family satisfies the assertions (i)-(iv) of the Proposition, it satisfies also the following one. Let be the complete binary tree with generations
[TABLE]
and let be the distribution of , where has distribution . Now we view as a probability measure on the space
. Then we have
-almost surely. 2. 2.
For all and , conditional on and independently of the variables , the distribution of is given by:
[TABLE] 3. 3.
For all , conditional on and independently of the rest, is defined as an exponential random variable with mean . 4. 4.
For all , . 5. 5.
For all , .
Indeed, assertion 1 is directly deduced from (i), 5 is trivial because is additive, and 2, 3 and 4 are proved by induction on using (iv). One can check that 2 stems from (ii) and (iv), 3 from (iii) and (iv), and 4 from (i) and (iv).
Now it is clear that these five assumptions define uniquely for and . Also, a measured simple tree for which is entirely described by . This implies that is uniquely determined by its marginal distribution .
Finally, we have shown that the family , where is the law of the genealogy of a pure-birth process started from , satisfies assertions (i)-(iv). In addition, we have shown that there is at most one family of simple tree distributions satisfying assertions (i)-(iv). Therefore, such a family exists and is unique, which concludes the proof.
∎
A.2 Proof of Lemmas 4.3 and 5.5
Let us write for the distribution of a CPP. Let be a Poisson point process with intensity as in our construction of CPP trees. Recall that and define
[TABLE]
Define also as the comb function tree given by with distribution denoted . Write for the distribution of the pair .
In Proposition A.4, we characterized the distributions of pure-birth processes. As a result, to conclude the present proof, it is sufficient to show that the family satisfies the following conditions:
- (i)
We have and -almost surely. 2. (ii)
We have . 3. (iii)
Under , is an exponential r.v. with mean . 4. (iv)
Under , define for , , , the measure on such that for all and finally . Then the conditional distribution of the pair of trees given is , i.e. they are independent with the same distribution .
Let us now prove each assertion.
(i) Since we have a.s. for any :
[TABLE]
Also, since is diffuse, we have a.s. for all that Those two conditions imply that is a complete binary tree.
(ii) – (iii) The first branching point of the tree is Also the total mass of the tree is , which is an exponential random variable with mean . We can easily compute the distribution of under , since conditional on , is a Poisson point process on with intensity . Therefore, for :
[TABLE]
(iv) It remains to prove the branching property for the family .
Under , conditional on , let and be independent random variables of identical distribution . We concatenate and , adding a point of height between the two sets:
[TABLE]
We claim that the following equality in distribution holds:
[TABLE]
which formulates the branching property for the family .
From basic properties of Poisson point processes, we know that conditional on , the highest atom of is , with having a uniform distribution on and independent of , such that
[TABLE]
The joint distribution of is therefore given by:
[TABLE]
In other words, the random variable has a density , and conditional on , follows a Gamma distribution with parameter . As is uniform on and independent of , one can check that has the same distribution as , where conditional on , the variables and are independent with the same exponential distribution with parameter . This concludes the proof of (9) since conditional on (resp. ), (resp. ) is a Poisson point process on (resp. on ) with intensity .
A.3 Subordinators and Regenerative Sets
We use some classical results about regenerative sets and subordinators, whose proofs can be found in the first two sections of Bertoin’s Saint-Flour lecture notes [6].
Definition A.5**.**
A subordinator is a right-continuous, increasing Markov process started from [math] with values in , where is an absorbing state, such that for all , conditional on , we have
[TABLE]
Theorem A.6**.**
The distribution of a subordinator is characterized by its Laplace exponent defined as the increasing function , such that for all ,
[TABLE]
with the convention for all . The Laplace exponent can be written under the form
[TABLE]
where is called the killing rate, the drift coefficient and the Lévy measure of the subordinator. Necessarily, we have and satisfies
[TABLE]
Letting be the lifetime of the subordinator, follows an exponential distribution with parameter (if , then ). Also we have almost surely for all ,
[TABLE]
and the set of jumps is a Poisson point process with intensity .
The renewal measure of a subordinator is defined as the measure on such that for any non-negative measurable function
[TABLE]
This renewal measure characterizes the distribution of since its Laplace transform is the inverse of
[TABLE]
Remark also that setting the right-continuous inverse of , we have
[TABLE]
Definition A.7**.**
Given a probability space equipped with a complete, right-continuous filtration , a regenerative set is a random closed set containing [math] for which the following properties hold
- •
Progressive measurability.* For all , the set is in .*
- •
Regeneration property.* For a -stopping time such that a.s. on , and is not right-isolated in , we have:*
[TABLE]
where is defined formally as the set .
We define the range of a subordinator as the closed set , and see that all regenerative sets can be expressed in this form.
Theorem A.8**.**
The range of a subordinator is a regenerative set. Conversely, if is a regenerative set without isolated points, there exists a subordinator whose range is almost surely.
Remark A.9**.**
In the case where a.s., one can define such a subordinator as
[TABLE]
Then is the unique subordinator with drift and range , and its renewal measure is . Notice that by definition. Therefore is an exponential random variable with parameter , the killing rate of .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abraham and Serlet [2002] R. Abraham and L. Serlet. Poisson snake and fragmentation. Electron. J. Probab. , 7, 2002. doi: 10.1214/EJP.v 7-116 . paper no 17. · doi ↗
- 2Abraham et al. [2013] R. Abraham, J.-F. Delmas, and P. Hoscheit. A note on the Gromov-Hausdorff-Prokhorov distance between (locally) compact metric measure spaces. Electron. J. Probab. , 18, 2013. doi: 10.1214/EJP.v 18-2116 . paper no 14. · doi ↗
- 3Aldous and Pitman [1998] D. Aldous and J. Pitman. The standard additive coalescent. Ann. Probab. , 26(4):1703–1726, Oct. 1998. doi: 10.1214/aop/1022855879 . · doi ↗
- 4Basdevant and Goldschmidt [2008] A.-L. Basdevant and C. Goldschmidt. Asymptotics of the allele frequency spectrum associated with the Bolthausen-Sznitman coalescent. Electron. J. Probab. , 13:486–512, 2008. doi: 10.1214/EJP.v 13-494 . paper no 17. · doi ↗
- 5Berestycki et al. [2014] J. Berestycki, N. Berestycki, and V. Limic. Asymptotic sampling formulae for Λ Λ \Lambda -coalescents. Ann. Inst. H. Poincaré Probab. Statist. , 50(3):715–731, Aug. 2014. doi: 10.1214/13-AIHP 546 . · doi ↗
- 6Bertoin [1997] J. Bertoin. Subordinators: examples and applications. In Lectures on probability theory and statistics: École d’Été de Probabilités de Saint-Flour XXVII - 1997 , pages 1–91. Springer, 1997. doi: 10.1007/978-3-540-48115-7_1 . · doi ↗
- 7Bertoin [2009] J. Bertoin. The structure of the allelic partition of the total population for Galton–Watson processes with neutral mutations. Ann. Probab. , 37(4):1502–1523, July 2009. doi: 10.1214/08-AOP 441 . · doi ↗
- 8Champagnat and Lambert [2012] N. Champagnat and A. Lambert. Splitting trees with neutral Poissonian mutations I: Small families. Stoch. Proc. Appl. , 122(3):1003–1033, 2012. doi: 10.1016/j.spa.2011.11.002 . · doi ↗
