The minimal observable clade size of exchangeable coalescents
Fabian Freund, Arno Siri-J\'egousse

TL;DR
This paper investigates the asymptotic behavior and moments of the minimal observable clade size in exchangeable coalescent models with mutations, providing insights relevant for genetic data analysis and model selection.
Contribution
It introduces asymptotic results and recursive formulas for the moments of the minimal observable clade size in $ ext{Lambda}$-coalescents with mutations, a quantity previously unobservable in real data.
Findings
Asymptotic behavior of $O_n$ as $n o $ derived.
Recursive formulas for all moments of $O_n$ established.
$O_n$ provides an upper bound for the minimal clade size.
Abstract
For --coalescents with mutation, we analyse the size of the partition block of at the time where the first mutation appears on the tree that affects and is shared with any other . We provide asymptotics of for and a recursion for all moments of for finite . This variable gives an upper bound for the minimal clade size [2], which is not observable in real data. In applications to genetics, it has been shown to be useful to lower classification errors in genealogical model selection [10].
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The minimal observable clade size of exchangeable coalescents
Fabian Freund
Crop Plant Biodiversity and Breeding Informatics Group (350b), Institute of Plant Breeding, Seed Science and Population Genetics, University of Hohenheim, Fruwirthstrasse 21, 70599 Stuttgart, Germany
and
Arno Siri-Jégousse
Departamento de Probabilidad y Estadística, IIMAS, Universidad Nacional Autónoma de México, Mexico City, Mexico.
Abstract.
For --coalescents with mutation, we analyse the size of the partition block of at the time where the first mutation appears on the tree that affects and is shared with any other . We provide asymptotics of for and a recursion for all moments of for finite . This variable gives an upper bound for the minimal clade size [2], which is not observable in real data. In applications to genetics, it has been shown to be useful to lower classification errors in genealogical model selection [10].
Key words and phrases:
clade size, --coalescent, recursion
2010 Mathematics Subject Classification:
Primary 60C05; Secondary 92D20, 60F15, 60G09
1. Introduction
The potential for adaption of organisms to diverse environments is based on their genetic diversity. Moreover, the specific historic pattern of adaptation and demography leaves distinct marks in the genetic diversity of a sample taken from a population of said organisms. When observing the genetic diversity of a single non-recombining part of the genome, the diversity can be described by the inheritance pattern of the mutations on the genealogical tree of the sample, usually given by a Poisson point process on the genealogy. Modelling the genealogy is thus an important aspect of modelling genetic diversity. Usually, the exact genealogy is not known and cannot be reconstructed perfectly from observed genetic data (an example is provided later). Thus, genealogy models are usually defined as random variables on the set of possible genealogical trees.
Here, we are concerned with the genealogical tree of alleles, i.e. the genetic information of a sample of size from a genomic region. Coalescent theory provides a rich class of genealogical tree models for a sample of alleles as much as elegant tools and a convenient setting up for statistical inference. In particular, Kingman’s coalescent [13] and the larger family of coalescents with multiple collisions [18, 20] were widely studied in the past decades. This class of Markov processes on the set of partitions of is characterized by a finite measure on , justifying their name of --coalescents. If a --coalescent has blocks, any given -tuple of them will merge at rate
[TABLE]
and the rate for the next coalescence event is
[TABLE]
The starting partition is . The genealogical tree is recovered from the partition-valued process by first starting branches from the leaves . Then, any merger of partition blocks corresponds to a joining of branches in a node (ancestor), where a single new branch starts. Partition blocks correspond to branches in the genealogical tree, the time a partition block is not merged gives the length of this branch. We refer to [11] for a survey.
Genealogical trees of the alleles in a genetic region come with an interpretation of relatedness and genetic similarity: An allele is more closely related to allele than to allele if the common ancestor of appears more recently than the common ancestor of , while the path lengths between leaves (alleles) measure the time available to accumulate mutations that decrease genetic similarity. Several statistics aim to capture these aspects and their biological meaning. For instance, the minimal clade size of an allele gives the number of closest relatives of , see [2]. Another example is the length of the external branch of , i.e. the waiting time for the first merger of , which gives a measure of the genetic uniqueness of an individual [19]. The mathematical properties of the minimal clade size [2, 9, 23, 8], the length of an external branch [2, 3, 5], as much as the family (partition block) sizes at this time [2, 24], have been analysed recently. In these works, asymptotic and exact behaviours are obtained for various examples of exchangeable coalescents.
However, these statistics cannot be observed directly from the genetic data. We will illustrate this for the minimal clade. By a clade we denote the set of all alleles that share a specific ancestor, and the minimal clade of is the smallest clade including . Assume the infinite-sites model of mutation, each mutation causes a change at a different position in the genomic region. Further assume that we know the ancestral state at the genomic region, i.e. we can identify mutations as changes compared to the ancestral state. Any clade can only be observed if there is at least one mutation that all its members share. This mutation is inherited from the common ancestor, thus has to be placed on the branch that connects this ancestor further towards the root of the genealogy (the most recent common ancestor of the whole sample). Thus, we can only observe a clade if there is a mutation on the branch directly above of the ancestor defining this clade.
Instead of looking at the minimal clade of an allele , one could consider the smallest clade which includes that can be observed from the data. We considered the sizes of these clades for all alleles sampled, the minimal observable clade sizes, in [10]. There we could show that they provide an additional set of statistics that faciliates the inference of a well-fitting genealogy model when coupled with standard statistics of genetic diversity as the site frequency spectrum.
In this article, we study some mathematical properties of the minimal observable clade size for an individual . Its asymptotic behaviour for any --coalescent for as well as a recursion for all moments for finite are established. For the Bolthausen-Sznitman coalescent, which provides a somewhat universal genealogical model for populations under strong selection, see e.g. [17], [4], [22], we can show that the minimal observable clade size is asymptotically Beta-distributed.
2. A formal definition of the minimal observable clade size
Let for , . For any --coalescent and a sampled allele , define
- •
as the partition block is in at time (a size-biased pick of a block of the -coalescent at time )
- •
as the total number of jumps of the --coalescent, as the total number of jumps (the block of) participates in
- •
as the successive indices of jumps in the --coalescent in which the block of is involved
- •
as the partition block is in at the time of its th jump , . , is the minimal clade of , .
Given the --coalescent tree, we set mutations on its branches via a homogeneous Poisson point process with rate . Mutations are interpreted under the infinite sites model, each mutation hits a site not hit by any other mutation, producing a new type. The new type is called derived type in contrast to the ancestral type of the most recent ancestor of the sample. Mutations on external branches are affecting only one individual, we will call these private mutations; they can also be referred to as singleton mutations. All other mutations are called non-private mutations. Since we are interested in the mutations carried by individual , we have to record the mutations from to the time back to the most recent common ancestor of the sample (the root of the genealogy) on the path of . Let be the waiting time until the first (youngest) mutation on the path of that is non-private, i.e. does not fall on the external branch which ends in (which has length ). If we continue the path of after reaching the most recent common ancestor as a single ancestral line indefinitely (which we will do from now on), we have
[TABLE]
for an independent exponential random variable with rate . Let
[TABLE]
i.e. the th jump that participates in is the last jump of it before . The minimal observable clade of is then given by
[TABLE]
The definitions are equivalent since all non-private mutations of are inherited from the youngest ancestor of that bears at least one mutation on the branch connecting it to the next older ancestor. If has no non-private mutations, almost surely, since in this case is larger than the time back to the most recent common ancestor.
The statistic we are interested in is the size of the minimal observable clade of an allele
[TABLE]
See Figure 1 for an example. Due to exchangeability, the distribution of does not depend on , we can even choose randomly without changing the distribution. For ease of notation, we fix the allele we are interested in to allele 1 and abbreviate .
Remark 2.1*.*
Since the partition block including can only increase in size over time, the minimal clade is a subset of the minimal observable clade for . Thus, the size of the minimal clade of satisfies . See Figure 1 for an example.
3. Asymptotics of the observable clade size
Asymptotically for sample size , the probabilistic structure of simplifies considerably. First, we focus on coalescents without dust, a class which includes Beta(,)--coalescents for [21], Kingman’s -coalescent () and the Bolthausen-Sznitman -coalescent ( uniform on ). A --coalescent has dust if and only if , see [18].
Theorem 3.1**.**
Let be defined for --coalescents such that (without dust) and with mutation rate . We have
[TABLE]
for with a.s.. is distributed as the size of the block of individual 1 (alternatively a size-biased pick of a block size) at a random time . The distribution of is uniquely determined by its moments
[TABLE]
where is the total rate of the -coalescent in a state with blocks and is a rational function of , defined as in [18, Prop. 29]. In particular,
[TABLE]
Proof.
Let be the waiting time for the first collision of individual 1. By the consistency of the -coalescents, we have . By a slight adaptation of [18, Prop. 26], we see that for . Since is monotonically decreasing, this convergence also holds almost surely.
All mutations of individual 1 on any -coalescent lie on the path of leaf 1 to the root of the coalescent tree and are consistent for different values of (any mutation on the path to the root in the -coalescent is also a mutation on this path in every -coalescent with ), since for , the -coalescent (seen as a tree) is the subtree of the -coalescent which is spanned by the leaves , including mutations. Thus, we can represent the mutations of individual 1 on the -coalescent by one common homogeneous Poisson process for all , independent of the coalescents, on with rate This gives a new representation for , it is the smallest Poisson point with . Let be the smallest Poisson point overall. Since a.s. for , for any realisation of the coalescent there exists a s.t. a.s. for all . This shows that a.s. for , which implies
[TABLE]
where is the (asymptotic) frequency of the block individual 1 is in at time in the -coalescent (with values in , see [18]). The existence of the limit follows from Kingman’s correspondence, since the coalescent stopped at the random time (independent of the coalescent) gives an exchangeable partition of . Since we have a coalescent without dust, we have no singleton blocks a.s. at any time and a (potentially infinite) number of blocks with a.s. positive frequencies, again due to Kingman’s correspondence. This shows a.s.. Due to exchangeability, the distribution of is the same as if we would make a size-biased pick from all blocks present.
Denoting by , consider the moments
[TABLE]
From [18, Eq. (50) and Prop. 29] we see that due to a connection to the exchangeable partition function of the coalescent at time . Thus, we have
[TABLE]
which is Eq. (4). Using the explicit values (essentially from [18, Eq. (39),(40)]) and yields the first two moments. Since takes values in , its distribution is uniquely determined by its moments. ∎
For the special case of the Bolthausen-Sznitman coalescent, the law of can be identified
Theorem 3.2**.**
For the Bolthausen-Sznitman -coalescent,
[TABLE]
for where .
Proof.
[18, Corrolary 16] shows that for the Bolthausen-Sznitman coalescent, jumps at independent standard exponential times with ranked jump sizes given by a Poisson-Dirichlet distribution with parameters . The set of jump times is independent of the set of jump sizes. Comparing this with Eq. (5), we see that to compute , we need to sum the sizes of all jumps of that happen before or at . Consider the jump times of ordered according to the rank of their jump sizes . Define . Hence, we have . We can now express
[TABLE]
In other words, can be seen as summing up a random thinning of a standard Poisson-Dirichlet distributed random variable.
The random variable is Beta distributed. To see this, we will use the construction of the distribution from [14], which is also summarised in [1, Section 4.11]. Consider the points of a Poisson point process on with mean measure . Then, the size-ordered and normalized points with have the Poisson-Dirichlet distribution and are independent of
[TABLE]
where is the Gamma distribution with shape parameter and rate .
We choose and make the correspondence between the ranked and normalised points and the jump sizes . To express Eq. (7), we give each point a mark . Marks are independent from and from one another. We set the probability to be marked to for all . is a marked Poisson process. The Colouring Theorem [15, Section 5.1] shows that all points with marks 1 form a Poisson point process with mean measure , while all points with mark 0 form a Poisson point process with .
We can now alternatively express (7) as
[TABLE]
where and are independent due to the independence of and . Since the mean measures of and are of the form with equal to and , Eq. (8) yields and . Thus, is Beta-distributed with parameters and . ∎
Remark 3.3*.*
Theorem 3.1 can be generalised for some time-changed --coalescents without dust, which appear when modelling genealogies in Cannings models with moderate fluctuations in population size, see [12], [16], [25] and [7]. Let be a time-changed --coalescent, where with continuous , which includes some time changes proposed for --coalescents in the references above. Observe that is continuous, monotone and invertible with differentiable inverse. The time-changed --coalescent is still exchangeable. The almost sure convergence of for the time-changed --coalescent works analogously as in Theorem 3.1. The time of the first merger of individual 1 is , thus also converges to 0 almost surely. The waiting time for the first mutation on the path of 1 to the root is an -distributed random variable, but on the time-changed path of . Thus, the limit of for the time-changed --coalescent is the frequency of the block containing 1 at time in . This can also be expressed as , where is said frequency in the --coalescent . The distribution of is given by
[TABLE]
which has density . Analogously to (6) we can thus express, in terms of the from Theorem 3.1, the th moment of for the -coalescent with exponential growth as
[TABLE]
As an example, consider Kingman’s -coalescent with exponential growth with rate . From [12], we see that and thus . This leads to moments
[TABLE]
Now, consider coalescents with dust which stay infinite. An example for this are Dirac -coalescents with , [6]. For -coalescents with dust, staying infinite is equivalent to .
Theorem 3.4**.**
Let be defined for --coalescents with (with dust), and with mutation rate . We have
[TABLE]
for with a.s.. We have with .
Proof.
From [8, Thm. 1], we see that the asymptotic frequency of the block of 1 forms an increasing jump-hold process with with values in , positive jumps and with i.i.d. waiting times between jumps. It fulfills , where is the value of at its th jump. We record between which indices of jumps , of the waiting time for the first non-private mutation falls. From the proof of [8, Cor. 1], we know that there exists so that equals the time of the first jump of for all almost surely. Similarly to the proof of Theorem 3.1, we just have to trace back the first mutation after this first jump whose time of appearance does not depend on . This implies that falls between the same for all . Thus we have a.s., where is the state of at the th jump. We only need to find the distribution of . For , is the waiting time for the first jump of plus an independent random variable . Using that the waiting times between the jumps of are i.i.d., we have , where is defined by and thus on . This yields on . We compute
[TABLE]
∎
4. Recursions for the moments
To obtain recursive formulae for the moments of for a --coalescent, we first need to introduce , the size of the block of 1 at the exponential clock of rate in the -coalescent.
Theorem 4.1**.**
Let . The th moments of and satisfy the following recursions: and
[TABLE]
and
[TABLE]
where , and
Proof.
Our proofs rely on tracking the number of blocks involved in the first jump of the --coalescent, with some additional condition(s). Let us first prove (9). Let be the waiting time for the first coalescence in the -coalescent which is -distributed. Let be the event that the first block merged is part of the block of 1 stopped at .
[TABLE]
Now let us turn to the proof of (10). Let be the event that 1 does participate in the first coalescence event. Also let be the event that the first block merged is part of the observed clade.
[TABLE]
Expanding, we obtain the result. ∎
Remark 4.2*.*
For Kingman’s -coalescent, (9) and (10) considerably simplify. In particular the two first moments of are
[TABLE]
and
[TABLE]
and the two first moments of are
[TABLE]
and
[TABLE]
FF was funded by DFG grant FR 3633/2-1 through Priority Program 1590: Probabilistic Structures in Evolution.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Richard Arratia, Andrew D. Barbour, and Simon Tavaré. Logarithmic combinatorial structures: A probabilistic approach . European Mathematical Society (EMS), Zürich, 2003.
- 2[2] Michael G.B. Blum and Olivier François. Minimal clade size and external branch length under the neutral coalescent. Adv. in Appl. Probab. , 37(3):647–662, 06 2005.
- 3[3] Amke Caliebe, Ralph Neininger, Michael Krawczak, and Uwe Rösler. On the length distribution of external branches in coalescence trees: Genetic diversity within species. Theor. Pop. Biol. , 72(2):245 – 252, 2007.
- 4[4] Michael M. Desai, Aleksandra M. Walczak, and Daniel S. Fisher. Genetic diversity and the structure of genealogies in rapidly adapting populations. Genetics , 193(2):565–585, 2013.
- 5[5] Jean-Stéphane Dhersin, Fabian Freund, Arno Siri-Jégousse, and Linglong Yuan. On the length of an external branch in the Beta-coalescent. Stochastic Process. Appl. , 123(5):1691–1715, 2013.
- 6[6] Bjarki Eldon and John Wakeley. Coalescent processes when the distribution of offspring number among individuals is highly skewed. Genetics , 172(4):2621–2633, 2006.
- 7[7] Fabian Freund. Cannings models, populations size changes and multiple-merger coalescents. Preprint on Arxiv , 2019.
- 8[8] Fabian Freund and Martin Möhle. On the size of the block of 1 for Ξ Ξ \Xi -coalescents with dust. Modern Stoch. Theory Appl. , 4(4):407–425, 2017.
