Coalescent theory of the ψ directionality index

Egor Lappo; Noah A Rosenberg

PMC · DOI:10.1093/g3journal/jkaf202·September 8, 2025

Coalescent theory of the ψ directionality index

Egor Lappo, Noah A Rosenberg

PDF

Open Access

TL;DR

This paper provides a theoretical foundation for the ψ directionality index, explaining its behavior under different demographic scenarios and improving its use in studying range expansions.

Contribution

The paper derives statistical properties of ψ using coalescent theory and identifies parameter combinations that affect its behavior.

Findings

01

The expectation E[Ψ] can be the same for different demographic scenarios.

02

The variance V[Ψ] increases with time since the bottleneck and bottleneck severity.

03

Ancestral population size affects ψ only through the number of loci used in computation.

Abstract

The ψ directionality index was introduced by Peter and Slatkin (Evolution 67: 3274–3289, 2013) to infer the direction of range expansions from single-nucleotide polymorphism variation. Computed from the joint site frequency spectrum for two populations, ψ uses shared genetic variants to measure the difference in the amount of genetic drift experienced by the populations, associating excess drift with greater distance from the origin of the range expansion. Although ψ has been successfully applied in natural populations, its statistical properties have not been well understood. In this article, we define Ψ as a random variable originating from a coalescent process in a two-population demography. For samples consisting of a pair of diploid genomes, one from each of two populations, we derive expressions for moments E[Ψk] for standard parameterizations of bottlenecks during a founder…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Drosophila melanogaster

Figures6

Click any figure to enlarge with its caption.

Fig. 1 — Four demographic scenarios. a) Population split: an ancestral population C splits into two populations A and B at time t. b) Exponential growth: after the split, populations A and B grow at rates rA and rB, respectively. c) Bottleneck: after the split, population A goes through a bottleneck of population size Nb and duration tb. d) Instantaneous bottleneck: after the split, population A goes through a burst of coalescences of strength s. Times t and t−tb are measured in generations back from the present.

Fig. 2 — The six tree topologies possible for samples of two lineages each from two populations A and B. Trees are labeled by Greek letters. Branches relevant to the calculation of E[Ψ]—branches that are ancestral to lineages from both populations—are labeled by bβ, bγ, etc.

Fig. 3 — Distinguishing locations of the cherry coalescence for topologies β and γ. The ancestral population is C and the descendant populations are A (left) and B (right).

Fig. 4 — Values of E[Ψ] and V[Ψ] under the founder effect demography of Fig. 1d. The population sizes are constant, with NA=400, NB=600, and NC=1,000. The values are computed using Equation (32a) for E[Ψ] and Equation (32c) for V[Ψ]. a) E[Ψ] for varying s, t=10. b) E[Ψ] for varying s, t=80. c) E[Ψ] for varying s, t=120. d) V[Ψ] for varying s, t=10. e) V[Ψ] for varying s, t=80. f) V[Ψ] for varying s, t=120.

Fig. 5 — Theoretical values of E[Ψ] (Equation (32a)) for varying values of NA and NB in the founder effect demography of Fig. 1d. Parameters s and t are fixed, with s=20 and t=50. The black line shows parameter sets (NA,NB) for which E[Ψ]=0.

Fig. 6 — Theoretical values of E[Ψ] (Equation (32a)) for varying values of t and NB in the founder effect demography of Fig. 1d. Parameters s and NA are fixed, with s=20 and NA=500. The black line shows parameter sets (t,NB) for which E[Ψ]=0.

Tables1

Table 1.. Inferred demographic parameters reported for European and African D. melanogaster populations in the previous studies that used X-chromosomal loci, and corresponding model-predicted values of E[Ψ] and V[Ψ].

Article	t	$N_{AFR}$	European bottleneck	$E [Ψ] (EUR, AFR)$	$V [Ψ] (EUR, AFR)$
Li and Stephan (2006)	158,000	8,603,000	$N_{EUR, b} = 2,200$ for $t_{b} = 3,400$ generations, then $N_{EUR} = 1,075,000$	0.3950	0.5372
Laurent et al. (2011)	168,490	3,589,770	Exponential growth from $N_{EUR, t} = 16,550$ to $N_{EUR, 0} = 1,224,378$	0.5165	0.4970
Duchen et al. (2013)	194,984	4,975,360	Exponential growth from $N_{EUR, t} = 16,982$ to $N_{EUR, 0} = 3,122,470$	0.4912	0.5092

Equations67

Funding1

—National Institutes of Health10.13039/100000002

Keywords

bottleneckcoalescent theorydirectionality indexfounder effectrange expansion

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsChemical Thermodynamics and Molecular Structure · Computational Drug Discovery Methods

Full text

Introduction

Inference of the demographic history of populations—including their population-size changes and relationships with other populations—is a major objective of statistical population genetics (e.g. Marchi et al. 2021). The combination of statistical methods based on the coalescent theory with extensive genetic data has enabled researchers to investigate diverse features of demographic histories (e.g. Pool et al. 2010).

One of the most fundamental ways in which genetic data can be summarized for statistical analysis is by the site frequency spectrum (SFS), which counts the numbers of sites—typically single-nucleotide polymorphisms (SNPs)—that are present in different multiplicities in a sample (e.g. Wakeley and Hey 1997; Achaz 2009). Comparisons of the empirical SFS in a population to predictions of a coalescent model can detect phenomena, such as bottlenecks, expansions, or selective sweeps (e.g. Ferretti et al. 2010; Ronen et al. 2013). The SFS has received extensive theoretical treatment under many demographic scenarios and has often been applied for inference in real populations (e.g. Nielsen et al. 2005; Thornton and Andolfatto 2006).

In data from multiple populations, a joint SFS can be defined that records SNP allele frequencies in each population (Gutenkunst et al. 2009). A joint SFS enables inference of processes, such as admixture, migration, and differences in selection between populations (Caicedo et al. 2007; Nielsen et al. 2009; Excoffier et al. 2013; Zhan et al. 2014; Arguello et al. 2019; Liu and Fu 2020). In the setting of population pairs, Peter and Slatkin (2013) proposed a statistic, the ψ directionality index, which is computed from the joint SFS for the two populations. This index was designed for characterizing the process of range expansion, in which a population sequentially settles locations increasingly distant from its origin (e.g. Ramachandran et al. 2005; Excoffier and Ray 2008; Excoffier et al. 2009).

In a range expansion, the leading edge of the expansion experiences stronger genetic drift relative to the point of origin (e.g. Hallatschek and Nelson 2008; Slatkin and Excoffier 2012; Peter and Slatkin 2015; Peischl and Excoffier 2016). In the genetic history of individuals at the edge of the expansion, the range expansion process can manifest as a sequence of population size bottlenecks, as increasingly distant geographic locations are settled (e.g. DeGiorgio et al. 2009; Deshpande et al. 2009; DeGiorgio et al. 2011).

For two populations that are part of the expansion, the ψ directionality index seeks to identify the direction of the expansion. The approach relies on the fact that if a given derived allele is shared between the two populations, then its frequency is expected to be higher in the derived population at the edge of the range expansion than in the source population (e.g. Edmonds et al. 2004; Klopfstein et al. 2006; Excoffier and Ray 2008; Schlichta et al. 2022). Alleles at low frequency in the source population are likely to be lost during the expansion and therefore would not be shared. The derived population has a smaller founding population size than the source population, so that alleles—if they are not entirely absent—tend to possess greater frequencies. The ψ index considers the population differences of allele frequencies specifically in the shared genetic variation between the two populations.

Among pairwise quantities that can be computed as summary statistics useful for interpreting population-genetic data (e.g. $[eqn]$ ), the ψ index stands out as a signed quantity. For two populations A and B, the order of the populations matters, with $[eqn]$ . Therefore, whereas $[eqn]$ is often seen as a genetic measure of distance, ψ is akin to a vector directed from one population to another (see also Peter and Slatkin 2013, Figs. 5 and 7).

The ψ index was first defined by Peter and Slatkin (2013), who developed a method that integrates information about pairwise ψ with geographic distances between sampling locations to identify coordinates of the expansion origin. They then applied it to simulated scenarios including isolation-by-distance and range expansion on a grid of populations, as well as to complex configurations involving migration barriers.

Peter and Slatkin (2015) then studied theoretical properties of ψ in a discrete time-expansion model. The model consisted of a linearly arranged set of demes with equal population size, with a single deme $[eqn]$ settled initially and the rest of the demes empty. At an integer timepoint t, a new deme $[eqn]$ is settled by individuals from the previous deme $[eqn]$ . The quantity of interest was $[eqn]$ at time t between the origin deme and the leading edge of the expansion. Peter and Slatkin (2015) showed that in the model, the expected value of ψ between the source and the leading edge of the expansion—which has experienced a sequence of founder events—depends on the relative founder sizes of settlement events (the fraction of individuals selected from deme $[eqn]$ to settle $[eqn]$ ) and the number of founder events, equal to t in their scaling of time. Peter and Slatkin (2015) used ψ to identify the expansion origin for natural populations of Arabidopsis thaliana, which they presumed to have expanded spatially in a manner compatible with a linear arrangement of demes.

Several recent uses of ψ have since sought to examine scenarios where, instead of an expansion over a linear spatial dimension, an expansion involves pairwise computations for a small number of discrete demes, as few as two. For example, Zhan et al. (2014) examined the expansion of monarch butterflies from North America to South America, the Pacific, and Europe, computing ψ between a source population in North America and a destination population elsewhere. Puckett and Munshi-South (2019) examined the expansion of brown rats from Eastern Asia to the Middle East, the Middle East to Europe, and Europe to North America, computing ψ between pairs of populations in two different geographic regions. Ioannidis et al. (2021) similarly used pairwise values of ψ between pairs of human populations of different Pacific islands to understand sequences of events in the human settlement of the region.

In this article, building from the interest in using ψ for expansions involving small numbers of discrete populations rather than many demes along a spatial continuum, we examine the ψ statistic theoretically in the simplest discrete-deme structured population: a pair of populations. We define Ψ as a random variable arising from the coalescent process and derive expressions for moments of Ψ under the coalescent. We focus on the scenario in which a single diploid individual is sampled in each of a pair of populations. Next, we consider specific commonly used parameterizations of range expansions in the setting of population pairs, explicitly incorporating exponential growth, bottlenecks, and instantaneous bottlenecks (Fig. 1). We then explore theoretical predictions for the expectation $[eqn]$ and variance $[eqn]$ , interpreting them in terms of the reliability of inferences and the identifiability of demographic scenarios. We use the central limit theorem to analyze the sample variance of the ψ index computed from many SNPs. Finally, we show how our results can be used in the evaluation of empirical inferences of demographic parameters for real populations.

Four demographic scenarios. a) Population split: an ancestral population C splits into two populations A and B at time t. b) Exponential growth: after the split, populations A and B grow at rates rA and rB, respectively. c) Bottleneck: after the split, population A goes through a bottleneck of population size Nb and duration tb. d) Instantaneous bottleneck: after the split, population A goes through a burst of coalescences of strength s. Times t and t−tb are measured in generations back from the present.

Coalescent-based definition of the ψ index

ψ for a pair of genomes

The directionality index ψ is a two-population statistic computed from allele frequencies for a set of biallelic SNPs. For the rest of this article, we assume that the derived and ancestral alleles are known for each SNP, and we call the SNP shared between two populations if the derived allele is present in at least one copy in both populations and the SNP is polymorphic in the pooled pair of populations.

Suppose now that we know allele frequencies for a set of SNPs in two populations A and B. In its most general form, the value of the ψ index is then defined as

[eqn]

where S is the set of SNPs shared between the two populations, $[eqn]$ is the frequency of the derived allele of SNP j in population A, and $[eqn]$ is its frequency in population B (Peter and Slatkin 2015, Equation 1).

We proceed by focusing on the simplest case in which ψ can be meaningfully studied in two populations. In particular, if allele frequencies are computed using a single diploid individual sampled from each population, and if shared SNPs are identified based on this pair of individuals, then the expression for ψ reduces to

[eqn]

where $[eqn]$ is the number of SNPs that have i copies of the derived allele in the individual from population A and j copies in the individual from population B, and i and j can each equal 0, 1, or 2.

The random variable Ψ under the coalescent

We now analyze the ψ index as a random variable under the coalescent. We use the notation Ψ to distinguish the theoretical random variable for the directionality index from the empirical ψ computed from data.

We assume that all SNPs are unlinked, so that coalescent trees for different SNPs are independent. We also assume that SNPs obey the standard infinitely-many-sites mutation model (Durrett 2008, p. 29), such that each SNP results from a single mutation on a coalescent tree. Finally, we assume that we have specified a demographic history for populations A and B (Fig. 1), and that a single diploid individual is sampled from each population. Such a sample configuration—one diploid individual with sample size 2 alleles in each population—allows us to use the simplified Equation (2).

Conditional on the demography and a sample of size 2 from each of a pair of populations, the coalescent model defines a probability distribution over the genealogies of lineages from A and B. In this framework, we can determine the expectation $[eqn]$ of the directionality index under the coalescent model for a single SNP shared between populations A and B. We use $[eqn]$ with $[eqn]$ to denote higher moments of the random variable Ψ under the coalescent.

To compute $[eqn]$ , we consider probabilities under the coalescent model of entries in the joint site frequency spectrum for populations A and B, conditional on the demography:

[eqn]

where $[eqn]$ is the probability that a randomly sampled mutation—that is, the derived variant of a random SNP on the genealogy of four lineages—occurs in i copies in population A and in j copies in population B. For example, $[eqn]$ is the probability that for a random SNP, the diploid sample from population A has one ancestral and one derived allele, and the sample from population B is homozygous with two copies of the derived allele. As a shorthand, we will say that such a SNP has “type 12,” and we indicate other elements of the site frequency spectrum similarly.

Suppose now that we have sampled a single shared SNP. First, the probabilities of a SNP having a specific type are obtained from the site frequency spectrum $[eqn]$ by conditioning on being shared,

[eqn]

[eqn]

[eqn]

Further, taking the total number of sampled SNPs to be 1 in Equation (2), we know that the value of Ψ is constant for all SNPs of the same type, with

[eqn]

[eqn]

[eqn]

Combining Equation (4) with Equation (5), we can write the definition of random variable Ψ for a single shared SNP:

[eqn]

The expectation of Ψ can then be straightforwardly computed as

[eqn]

The second moment of Ψ is

[eqn]

The higher moments of Ψ can be computed similarly, with

[eqn]

[eqn]

In the remainder of this article, we discuss only the expectation $[eqn]$ and the variance $[eqn]$ , as the other moments can be obtained from these cases.

The only remaining quantities we need are the $[eqn]$ : the probabilities that a randomly chosen SNP has i copies of the derived allele in the sample of size 2 from population A and j copies in the sample of size 2 from population B, with $[eqn]$ , $[eqn]$ , or $[eqn]$ . In other words, under a coalescent-based demographic model with infinitely-many-sites mutation, we seek to compute, as a fraction of all SNPs, the number that occur on genealogical branches ancestral to i copies in population A and j copies in population B.

In a random genealogy, the expected total number of SNPs with type ij is $[eqn]$ , where $[eqn]$ is the total length of branches ancestral to i lineages from population A and j lineages from population B, and $[eqn]$ is the Poisson mutation rate along a branch. $[eqn]$ is computed by considering each topology separately:

[eqn]

where $[eqn]$ is the probability that topology τ occurs and $[eqn]$ is the length of branches ancestral to i lineages from A and j lineages from B in genealogies with topology τ. The value of $[eqn]$ is then proportional to

[eqn]

Because we regard lineages within populations as exchangeable—so that we do not distinguish between two lineages from the same population—six topologies must be considered in Equation (12) (Fig. 2). We denote the six topologies α, β, γ, δ, $[eqn]$ , ζ. The topology probabilities $[eqn]$ and the expected branch lengths $[eqn]$ can be computed for various demographic models, so that Equation (12) can be calculated and hence also Equations (7) and (8). In the next section, we compute these quantities for simple models representing a founder effect.

The six tree topologies possible for samples of two lineages each from two populations A and B. Trees are labeled by Greek letters. Branches relevant to the calculation of E[Ψ]—branches that are ancestral to lineages from both populations—are labeled by bβ, bγ, etc.

Expectation and variance of Ψ for specific demographic models

The exact expressions for topology probabilities and branch lengths in Equation (12) depend on specific parameterizations of the demographic history. In this section, we derive expressions for $[eqn]$ and moments of Ψ in Equations (7) and (8) for the four demographies shown in Fig. 1.

Population split

We first consider a simple population split demography (Fig. 1a). A single ancestral population C of size $[eqn]$ splits into two populations t generations ago. The two resulting populations A and B have sizes $[eqn]$ and $[eqn]$ individuals, respectively.

To compute $[eqn]$ and $[eqn]$ , we compute topology probabilities and relevant branch length expectations for each topology in Fig. 2. Computations with topology α are not needed because this topology cannot generate shared polymorphisms. For topology probabilities $[eqn]$ and $[eqn]$ , we must further distinguish the tree topologies based on the population in which the “cherry coalescence” happens. We denote by $[eqn]$ the probability of the topology β in which the coalescence (B,B) happens in population B; if the coalescence (B,B) happens in the ancestral population C, then we denote the probability by $[eqn]$ , with similar notation for topology γ. These additional topology labels are depicted in Fig. 3.

Distinguishing locations of the cherry coalescence for topologies β and γ. The ancestral population is C and the descendant populations are A (left) and B (right).

First, we compute $[eqn]$ . The two lineages from population B must coalesce in population B; the probability of this event is $[eqn]$ . The two lineages from population A must not coalesce until they enter population C; the probability of that event is $[eqn]$ . As a result, two lineages from A and one lineage from B enter population C. In population C, lineages from A and B must coalesce first; the probability of this event is $[eqn]$ ; in the remaining $[eqn]$ of cases, the two A lineages coalesce first. As a result, we obtain $[eqn]$ . This derivation modifies the two-population calculation of Tajima (1983) by allowing for different population sizes for populations A and B rather than assuming their exchangeability. Using the same logic for other tree topologies, we obtain the following probabilities:

[eqn]

[eqn]

[eqn]

[eqn]

For β and γ, summing the probabilities of the two cases, we get

[eqn]

[eqn]

We now compute expected lengths of relevant branches for specific sample configurations. All branches below the gene tree root that are shared by at least one pair of lineages from different populations are labeled in Fig. 2. For example, branch $[eqn]$ is ancestral to two lineages from population B and one lineage from population A for topology β; similarly, branch $[eqn]$ is ancestral to one lineage from population A and one lineage from population B if the genealogy has topology ζ.

With Equation (12), we obtain equations for entries of the expected joint site frequency spectrum $[eqn]$ :

[eqn]

[eqn]

[eqn]

In the final expressions for moments of Ψ (Equations (7) and (8)), the mutation rate cancels because all branches that can generate shared sites can only appear in population C, so that $[eqn]$ in all parts of Equation (15).

We are now left with calculating the expected branch lengths. Because polymorphisms shared between populations A and B can only result from mutations in the ancestral population C, our branch length calculations need only consider coalescent theory in a single population of size $[eqn]$ individuals. In particular, expectations of branch lengths $[eqn]$ and $[eqn]$ are equal to the expectation of the time $[eqn]$ to coalescence of two lineages in the diploid population of size $[eqn]$ , so that

[eqn]

Similar logic applies to trees $[eqn]$ and ζ, with expectations of $[eqn]$ and $[eqn]$ equaling the expectation of the time $[eqn]$ to the first coalescence with three lineages,

[eqn]

The lengths of $[eqn]$ and $[eqn]$ are again proportional to $[eqn]$ as in Equation (16),

[eqn]

The expected length of branches $[eqn]$ and $[eqn]$ together is equal to $[eqn]$ ,

[eqn]

Finally, we can substitute expressions for the branch lengths (Equations (16)–(19)) and the topology probabilities (Equation (13)) into Equation (15) to obtain expressions for SFS entries $[eqn]$ . We then plug these quantities into Equations (7) and (8) to obtain

[eqn]

[eqn]

The variance of Ψ is then

[eqn]

Examining the expressions in Equations (20a) and (20c), we see that both the mean of Ψ and the variance of Ψ do not depend on the size $[eqn]$ of the ancestral population. We also observe that if the population sizes are equal, $[eqn]$ , then $[eqn]$ . Moreover, examination of Equation (20) can directly bound the expectation and variance. As $[eqn]$ and $[eqn]$ , we have $[eqn]$ ; for variance, we have $[eqn]$ because, on one hand, $[eqn]$ , and on the other hand,

[eqn]

for $[eqn]$ and $[eqn]$ .

Informally, we can write Equation (20a) as

[eqn]

In this simple model, the “amount of drift” is that of a neutral population of size $[eqn]$ (or $[eqn]$ ) evolving for t generations. However, by treating $[eqn]$ and $[eqn]$ as effective population sizes, a variety of demographic scenarios that include population growth or bottlenecks can be considered. In subsequent subsections, we explicitly parameterize models with population size changes and present modified versions of Equation (20).

Exponential growth

We next consider populations A and B evolving under the classic exponential growth model. A and B begin exponential growth immediately after splitting from the ancestral population C, as shown in Fig. 1b.

Let population A have size $[eqn]$ at the present time, such that its population size over time is

[eqn]

where τ is time, measured in generations from the present into the past, and $[eqn]$ is the growth rate. Equation (22) is defined such that if $[eqn]$ , then population A is increasing in size forward in time. If population A has size $[eqn]$ immediately after the split, then the growth rate can be computed from Equation (22) as

[eqn]

Slatkin and Hudson (1991) showed that for a pair of lineages, the coalescent in a growing population of size $[eqn]$ is equivalent to the coalescent in the constant population of size $[eqn]$ , with time rescaled by

[eqn]

Hence, the probability that two lineages coalesce in the first t generations in the population of size $[eqn]$ is

[eqn]

A corresponding equation holds for population B.

We can repeat the calculations of tree topology probabilities in Equations (13) and (14) by replacing the constant-size coalescence probability $[eqn]$ by the quantity in Equation (25). As a result, we obtain the following expressions for expectation and variance of Ψ under the exponential growth model:

[eqn]

[eqn]

[eqn]

where we have introduced a shorthand notation

[eqn]

[eqn]

If only one population is subject to exponential growth, then expressions for $[eqn]$ and $[eqn]$ can be found by taking the limits in Equation (26) as the growth rate approaches zero. For example, if $[eqn]$ , then we have

[eqn]

[eqn]

where again $[eqn]$ is defined by Equation (27a).

Bottleneck

For our third model, we assume that immediately after the split, population A goes through a bottleneck of length $[eqn]$ with constant population size $[eqn]$ , as shown in Fig. 1c. This type of model has been used in studies of human expansion from Africa (DeGiorgio et al. 2009, 2011).

The calculations of $[eqn]$ and $[eqn]$ are similar to those for the population split demography, except that the topology probabilities differ: we consider special cases for topologies β and γ (Fig. 2). For β, we distinguish between coalescent genealogies in which the node (B,B) is located in population B ( $[eqn]$ ) and in population C ( $[eqn]$ ). For γ, we distinguish between genealogies in which the node (A,A) occurs in population A after the bottleneck ( $[eqn]$ ), during the bottleneck ( $[eqn]$ ), and in population C ( $[eqn]$ ). The probabilities are:

[eqn]

[eqn]

[eqn]

[eqn]

[eqn]

The expressions for the moments of Ψ are

[eqn]

[eqn]

[eqn]

If $[eqn]$ , then Equation (30) reduces to Equation (20). If $[eqn]$ , then expressions in Equation (30) match Equation (20) irrespective of the value of $[eqn]$ .

Founder effect

The final model that we consider is a model that has been proposed for simplifying the modeling of founder effects. Instead of a prolonged bottleneck, we introduce an instantaneous bottleneck into population A (Fig. 1d). An instantaneous bottleneck is defined as a burst of coalescences; mathematically, two lineages going through an instantaneous bottleneck of strength s behave as if going through s (imaginary) generations of drift in the population of final size $[eqn]$ . Instantaneous bottlenecks are typically used in situations where the bottleneck is short enough such that the possibility of mutations happening during the bottleneck can be disregarded (Galtier et al. 2000; Bunnefeld et al. 2015). In practice, this scenario could correspond to a low number of lineages from population C settling the whole population A that exists after the split.

Similarly to the bottleneck demography scenario, we adjust the tree topology probabilities to reflect the demography in Fig. 1d:

[eqn]

[eqn]

[eqn]

[eqn]

[eqn]

The expressions for the moments of Ψ in this case are:

[eqn]

[eqn]

[eqn]

These expressions reflect the fact that the “strength” of the bottleneck depends on its duration $[eqn]$ and population size $[eqn]$ only through the ratio $[eqn]$ , as captured by the parameter s. If $[eqn]$ , then Equation (32) reduces to Equation (20).

Illustrations of E[Ψ] and V[Ψ]

To illustrate our theoretical expressions, we plot $[eqn]$ and $[eqn]$ for a range of parameter values. For these plots, we use the instantaneous bottleneck formulation, as it has only five parameters t, s, $[eqn]$ , $[eqn]$ , and $[eqn]$ instead of six, as in the exponential growth and bottleneck scenarios.

Figure 4 shows $[eqn]$ and $[eqn]$ for varying t and s and fixed population sizes $[eqn]$ , $[eqn]$ , and $[eqn]$ . The behavior of $[eqn]$ confirms the intuition in Equation (21): as the bottleneck strength s increases, population A accumulates larger amounts of drift due to increased probability of coalescence in the bottleneck (Equation (32a)), leading to higher positive values of $[eqn]$ . Similarly, increasing t leads to more drift in population A because $[eqn]$ , and A accumulates drift at a higher rate than B. Note that if we were to set $[eqn]$ in Equation (32a), then the value of $[eqn]$ would not depend on the time t since the bottleneck.

Values of E[Ψ] and V[Ψ] under the founder effect demography of Fig. 1d. The population sizes are constant, with NA=400, NB=600, and NC=1,000. The values are computed using Equation (32a) for E[Ψ] and Equation (32c) for V[Ψ]. a) E[Ψ] for varying s, t=10. b) E[Ψ] for varying s, t=80. c) E[Ψ] for varying s, t=120. d) V[Ψ] for varying s, t=10. e) V[Ψ] for varying s, t=80. f) V[Ψ] for varying s, t=120.

The variance $[eqn]$ also increases with both the time t since the bottleneck and the bottleneck strength s, with stronger dependence on t compared to $[eqn]$ . As s or t increases, the probability of observing a SNP of type 11, $[eqn]$ , decreases, as a type 11 SNP requires all four lineages from A and B to persist into population C without coalescing, whereas type 12 and type 21 SNPs can be produced with only three lineages persisting into population C. SNPs of type 11 can appear only in genealogies δ, $[eqn]$ , and ζ where all coalescences happen in ancestral population C, and the probability of observing these genealogies is small for large s and t. The second moment $[eqn]$ then grows large, increasing the variance.

We can use our theoretical results to analyze identifiability of demographic scenarios with the ψ index. In analyses of genetic data, a positive ψ is used to claim that the population A is located further from the source of the range expansion (Peter and Slatkin 2013), with the population with more drift experiencing bottlenecks during founder events. However, this logic does not account for the possibility that other demographic scenarios could generate an identical value of ψ. Figure 5 provides an example of this phenomenon by showing that if the population size $[eqn]$ is sufficiently small relative to $[eqn]$ , then $[eqn]$ could be negative even in the presence of a bottleneck in population A. Figure 6 shows the dependence of $[eqn]$ on $[eqn]$ and t. If $[eqn]$ is small enough in relation to $[eqn]$ , then the value of $[eqn]$ can decrease with increasing t and can even reverse its sign. The negative value of the directionality index then obscures the ancient bottleneck in A.

Theoretical values of E[Ψ] (Equation (32a)) for varying values of NA and NB in the founder effect demography of Fig. 1d. Parameters s and t are fixed, with s=20 and t=50. The black line shows parameter sets (NA,NB) for which E[Ψ]=0.

Theoretical values of E[Ψ] (Equation (32a)) for varying values of t and NB in the founder effect demography of Fig. 1d. Parameters s and NA are fixed, with s=20 and NA=500. The black line shows parameter sets (t,NB) for which E[Ψ]=0.

Sampling theory of Ψ

We have demonstrated that the expectation $[eqn]$ and variance $[eqn]$ do not depend on the ancestral population size $[eqn]$ . In this section, we show that our confidence in the value of ψ computed from SNP data does depend on $[eqn]$ through sample variance.

The random variable Ψ and its associated quantities $[eqn]$ and $[eqn]$ refer to the directionality index for a single SNP under the coalescent. In a data analysis, ψ is computed using many shared SNPs across the genome, say n. Denote by $[eqn]$ the (random) many-SNP ψ index, signifying that this quantity can be seen as the mean of many single-SNP observations of ψ. For n sampled independent shared SNPs, the central limit theorem states that the resulting $[eqn]$ approaches a Gaussian distribution with variance proportional to $[eqn]$ ,

[eqn]

As the number of sampled shared SNPs increases, the probability that $[eqn]$ is close to its mathematical expectation $[eqn]$ increases.

Under the infinitely-many-sites model, the number of shared SNPs n is itself a random variable that depends on the mutation rate μ and the ancestral population size $[eqn]$ , as shared SNPs reflect mutations in the ancestral population. More precisely, $[eqn]$ , where $[eqn]$ is the scaled mutation rate in a diploid population of $[eqn]$ individuals, and $[eqn]$ is the expected length of branches that can yield shared SNPs, in units of $[eqn]$ generations. All branches that can generate shared SNPs were identified in Fig. 2, so we can use Equations (16)–(19) to write an equation for $[eqn]$ :

[eqn]

For example, with the founder effect model, we get

[eqn]

[eqn]

The expected number of shared SNPs depends linearly on the ancestral population size, and $[eqn]$ then affects the number of shared SNPs available for empirical analyses using the directionality index.

Equation (36) specifies the dependence of the random variable n on the demographic parameters. For example, we can see that stronger bottlenecks—higher values of s—lead to an increase not just in the variance $[eqn]$ (Fig. 4), but also in the sample variance by decreasing the number of available shared SNPs, $[eqn]$ .

Application to Out-of-Africa expansion of Drosophila melanogaster

To test our coalescent-based predictions for ψ, we compute ψ for a specific demographic event in two ways. First, we use Equation (1) to compute ψ directly from genotypes in natural populations. Second, we make use of existing estimates of demographic parameters to evaluate our equations for $[eqn]$ and $[eqn]$ .

The demographic scenario we consider here is the Out-of-Africa expansion of D. melanogaster. It is generally agreed that the modern European populations trace to a small founding population as the species range expanded from Africa (e.g. Stephan and Li 2007 ; Arguello et al. 2019). As the founder event was directed from Africa to Europe, we expect to see $[eqn]$ .

ψ computed from sequence data

For our empirical computations, we evaluated ψ from the sequences of intronic and intergenic X-chromosomal loci used for demographic inference by Li and Stephan (2006), Laurent et al. (2011), and Duchen et al. (2013), originally obtained by Glinka et al. (2003) and Ometto et al. (2005). We downloaded sequences of X-chromosomal loci from the European Nucleotide Archive (ebi.ac.uk/ena), sequence IDs AJ568984 to AJ571588 and AM000058 to AM003900 (originally deposited by Glinka et al. 2003; Ometto et al. 2005).

Each locus had nucleotide data in the form of a single (haploid) genotype for a set of inbred lines from European (the Netherlands, NTH) and African (Zimbabwe, ZW) populations of D. melanogaster, as well as for a single line from a North American population of Drosophila simulans. The genetic sequence for each line was haploid due to the sequencing being performed with homozygous inbred lines. The total number of X-chromosomal loci was 229, with locus sequence lengths ranging from 210 to 784 nucleotides (median 563).

The D. simulans sequence was used in place of the ancestral genotype in the analysis. Across the 229 loci, the maximum number of lines sequenced for the Netherlands population was 12 and the minimum number of lines sequenced was 10. For the Zimbabwe population, the maximum was 12 and the minimum was 9.

Separately for each locus, we used MUSCLE v5.1 (Edgar 2004) with default settings (-perturb 0 -perm none -consiters 2 -refineiters 100) to perform a joint multiple sequence alignment for the lines from the NTH and ZW populations of D. melanogaster as well as the D. simulans line.

To compute ψ for a set of loci, we generated a sample of 1,000 sets of four lines, two from the NTH population and two from the ZW population. The sets of four were sampled with replacement, but each set had two distinct NTH lines and two distinct ZW lines (“distinct” here refers to distinct sample labels, not to distinctness of the genotypes).

For each set of four lines together with the D. simulans line, we discarded sites that had insertions or deletions in the alignment of five sequences. We next discarded invariable sites as well as sites with three or more distinct alleles. Next, we discarded sites that failed to meet a sharing criterion. In particular, we kept only those shared (biallelic) sites in the sense of the definition in Equation (1), requiring the derived allele to be present in at least one copy in both NTH and ZW populations and to be polymorphic in the pooled pair of populations. We then computed ψ using Equation (2) by sampling a single site from the final set of shared sites.

To understand the uncertainty in the ψ computation that arises from differences in evolutionary history across loci, we analyzed 1,000 bootstrap replicate datasets, where each bootstrap replicate involves a resample of 229 loci. In particular, in each bootstrap replicate, we first sampled 229 loci with replacement. Next, we generated 1,000 sets of four lines and computed ψ, as described in the previous two paragraphs.

For each bootstrap replicate, we averaged ψ over the 1,000 values, each obtained from a random set of four lineages, obtaining a mean ψ for that replicate. We also obtained a variance across the 1,000 values.

Considering the 1,000 bootstrap replicates, the median of the mean ψ values was

[eqn]

with 95% of mean ψ values lying in the interval (0.4860, 0.5710).

The median across 1,000 bootstrap replicates of the variance of ψ was equal to

[eqn]

and the interval containing the variance values from 95% of the replicates was (0.4811, 0.5666).

E[Ψ]

computed from demographic estimates

We now compare empirical ψ values with the ψ values predicted by demographic models; we use demographic models that have been inferred in studies of the European founder event in D. melanogaster.

Multiple studies have estimated population sizes and divergence times for D. melanogaster. In particular, Li and Stephan (2006) used a maximum likelihood method based on the joint SFS, and Laurent et al. (2011) and Duchen et al. (2013) used approximate Bayesian computation. All three studies used the same set of X-chromosomal sequences from Glinka et al. (2003), with the Netherlands representing Europe and Zimbabwe representing Africa.

The study of Laurent et al. (2011) incorporates Asian samples, and the study of Duchen et al. (2013) adds North American samples, but here we focus on subsets of the inferred demographic parameters in these studies, specifically on the divergence of African and European populations shared by all three studies.

The modes of parameter estimates from the three articles are summarized in Table 1. The model of Li and Stephan (2006) assumes a prolonged bottleneck of constant size in the European population. Laurent et al. (2011) and Duchen et al. (2013) instead assumed exponential growth in Europe. All three models assume constant population size in the African population after the split with the European population.

For Li and Stephan (2006), the values for the model in their Fig. 1a appear on p. 1,582–1,583; African population size is labeled $[eqn]$ by Li and Stephan (2006, p. 1,582), current European population size is $[eqn]$ (p. 1,583), bottleneck European population size is $[eqn]$ (p. 1,583), and time variables are $[eqn]$ for bottleneck length and $[eqn]$ for post-bottleneck interval length (p. 1,583). In their inference algorithm, although the data are from the X chromosome, the authors use a coalescent process with N diploid individuals and a parameterization $[eqn]$ . In our calculations in the section on “Expectation and variance of Ψ for specific demographic models,” we have also assumed that the coalescent process has N diploid individuals and $[eqn]$ lineages. Because the parameterization of Li and Stephan (2006) matches our parameterization in Equation (30), we use the values from Li and Stephan (2006) directly.

For Duchen et al. (2013), we extracted parameter values for their model C (defined in their Table S2) for the African population from their Table 4 ( $[eqn]$ ) and for the European population from their Table 5 ( $[eqn]$ and $[eqn]$ immediately after the split and at the present time, respectively, and time $[eqn]$ since the split), exponentiating values reported logarithmically. The tables of Duchen et al. (2013) report numbers of diploid individuals in a population; hence, we use their population size values directly to calculate $[eqn]$ and $[eqn]$ .

For Laurent et al. (2011), we extracted estimates of parameter values from the X-chromosome column in their Table 3 for the model in their Fig. 1. Laurent et al. (2011) assumed equal proportions of males and females in the population, explicitly considering the X chromosome, so that the total number of lineages in a population of size N is $[eqn]$ , and $[eqn]$ . To match the diploid autosomal parameterizations under which we derived our theoretical expressions in the section “Expectation and variance of Ψ for specific demographic models,” we re-scaled population sizes reported in Table 3 of Laurent et al. (2011) by multiplying them by $[eqn]$ , and we then used them to calculate $[eqn]$ and $[eqn]$ ; this procedure is equivalent to rederiving Equation (28) with $[eqn]$ in place of $[eqn]$ and then inserting the N values from Laurent et al. (2011) directly.

For the demographic parameters of Li and Stephan (2006) that used a simple bottleneck model, we used Equation (30) with $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ from Table 1. For the demographic parameters of Laurent et al. (2011) and Duchen et al. (2013), which used an exponential growth model, we used Equation (28) with $[eqn]$ , $[eqn]$ , $[eqn]$ , and the exponential growth rate $[eqn]$ computed from European population sizes using Equation (23).

The values of $[eqn]$ and $[eqn]$ computed for each set of parameters are shown in Table 1. For the Laurent et al. (2011) and Duchen et al. (2013) demographies, the value we expect from the coalescent theory— $[eqn]$ in Table 1—lies inside the 95% bootstrap interval for the value obtained directly from data in Equation (37).

The variance in our empirical calculation (Equation (38)) closely matches the values of $[eqn]$ implied in Table 1 by the demographic models, with all three values lying in the 95% bootstrap interval.

Discussion

Summary

We have examined the directionality index Ψ as a random variable under coalescent models of two populations with a shared demographic history. Using this formulation, we have derived exact values for the expectation $[eqn]$ and variance $[eqn]$ of the directionality index for four parameterizations of a population split demography. We have explored the behavior of the expectation and variance, showing the dependence of Ψ on demographic parameters and identifying parameter regions for which a positive value of ψ does not necessarily mean that the “A” population is more distant from the source of a range expansion. Our expression for $[eqn]$ also allowed us to connect the sample variance of ψ across many sites to the size of the ancestral population. Finally, we showed how our theoretical results can be used to compare the predictions of demographic models with empirical observations.

Our explorations of the theoretical behavior of ψ show that in a sample of size 4 lineages, $[eqn]$ tends to be more sensitive to changes in the bottleneck strength s and derived population sizes $[eqn]$ and $[eqn]$ than to the time t since the population split (Fig. 4a–c). The variance $[eqn]$ , however, increases quickly with increasing t (Fig. 4d–f). These results are informative for considering the effects of bottlenecks on empirical values of ψ. For example, in a model in which a bottleneck is ancient, the variance of Ψ would be larger compared to a model with a recent bottleneck.

Expressions for $[eqn]$ and $[eqn]$ (Equations (20), (26), (30), and (32)) do not depend on $[eqn]$ , the ancestral population size. This insight suggests that predictions about range expansions under the model are largely unaffected by events in the shared history of the two populations. However, $[eqn]$ does affect the number of SNPs available for empirical calculation. When data from many sites are used to compute ψ, the expected number of shared SNPs in the calculation is proportional to $[eqn]$ ; for small $[eqn]$ , we might not observe enough shared SNPs for the empirical computation of ψ to accurately reflect a model-based prediction.

Comparison of theoretical and empirical ψ for D. melanogaster

In an application to data from D. melanogaster, empirical evaluation of ψ revealed a positive value in a scenario with a European population in the role of the “A” population and an African population in the role of the “B” population. This observation is consistent with the higher level of drift in European populations of D. melanogaster than in African populations: the analysis accords with the general understanding of D. melanogaster demographic history. Further, for two of three D. melanogaster modeling studies, the empirical value of ψ matched the predictions for $[eqn]$ from the demographic models (Table 1).

Connections

Our results recapitulate some of the insights of the branching process analysis of the discrete-time expansion model of Peter and Slatkin (2015). In that model, the expectation of Ψ was found to be $[eqn]$ where N is the population size of each deme, k is the founder population size during settlement of a new deme, and t is the settlement time of the tth deme, the integer time variable that counts sequential founder events from the origin to the most recently settled deme. This expression shows that $[eqn]$ increases with smaller values for sizes of the founder populations (k) and with the number of founder events (t).

In our formulations, the founder population size corresponds to the initial population size after the split $[eqn]$ in the exponential growth model, the bottleneck population size $[eqn]$ in the bottleneck model, and the reciprocal of the bottleneck strength s in the instantaneous bottleneck model. In these cases, we have observed a similar pattern in the magnitude of $[eqn]$ , which increases with smaller $[eqn]$ (Equation (26a)), smaller $[eqn]$ (Equation (30a)), or larger s (Equation (32a), Fig. 4).

The role of the time variable t, however, differs between the model of Peter and Slatkin (2015) and our analysis. In particular, the linear chain of many populations by Peter and Slatkin (2015) gives rise to a linear dependence of $[eqn]$ on t, whereas our two-population model produces a nonlinear dependence of $[eqn]$ on t due to interactions among various demographic parameters (Fig. 6).

Our study follows a similar spirit to the work of DeGiorgio et al. (2011), who derived expressions for the distribution of pairwise coalescence times in a serial founder model with a sequence of multiple bottlenecks. Many population-genetic statistics are functions of expected pairwise coalescence times, among them $[eqn]$ (Slatkin 1991) and $[eqn]$ (Peter 2016). Because our study uses ratios of certain expected branch lengths rather than the pairwise coalescence times themselves, our expressions for $[eqn]$ and $[eqn]$ are perhaps more closely connected to analyses that focus on internal and external branch length computations and other coalescent-based branch length ratios (Fu and Li 1993; Ferretti et al. 2017; Alimpiev and Rosenberg 2022).

An additional connection to other coalescent studies (Tajima 1983; Takahata and Nei 1985; Takahata and Slatkin 1990; Szpiech and Rosenberg 2011; Rosenberg 2013; Guerra and Nielsen 2022; Peter 2022) is that we have focused on the case of 4 sampled lineages. In many problems, the four-lineage analysis is the simplest non-trivial case, it can be studied analytically, and it provides insights useful for larger samples.

Further work

Our models are focused on pairs of populations, bottlenecks, and infinitely-many-sites mutation. Extended models could potentially consider additional phenomena; for example, recurrent and reverse mutation, the influence of natural selection on coalescence times for some sites, linkage among sites, and demographies that allow migration after population divergence. In the case of migration, a more recent study of than those that underlie Table 1 suggests a high rate of back-migration of D. melanogaster from Europe to Africa (Arguello et al. 2019), so that predictions for ψ in models that include migration would be meaningful. One approach to considering migration with ψ is an extension of the discrete-time expansion model of Peter and Slatkin (2015). Including migration between demes after founding events and exploring its impact on ψ is possible in a simulation-based extension of their model (Kemppainen et al. 2024).

In the framework of our theoretical analysis with four lineages and two populations, migration would allow for private mutations that appeared in population A to be introduced into population B, decreasing the value of ψ. Topologies that are more likely to generate shared mutations (such as δ, $[eqn]$ , and ζ in Fig. 2) would be observed more often, due to lineages being transported between populations in the time since the split between populations A and B, altering the values of $[eqn]$ and $[eqn]$ . The theory could potentially be pursued by adding ψ to coalescent models that allow for post-divergence migration (Wakeley 1996 ; Rosenberg and Feldman 2002; Teshima and Tajima 2002; Wilkinson-Herbots 2008; Hobolth et al. 2011; Wilkinson-Herbots 2012).

Bibliography61

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Achaz G . 2009. Frequency spectrum neutrality tests: one for all and all for one. Genetics. 183:249–258. 10.1534/genetics.109.104042.19546320 PMC 2746149 · doi ↗ · pubmed ↗
2Alimpiev E, Rosenberg NA. 2022. A compendium of covariances and correlation coefficients of coalescent tree properties. Theor Popul Biol. 143:1–13. 10.1016/j.tpb.2021.09.008.34757022 PMC 9731325 · doi ↗ · pubmed ↗
3Arguello JR, Laurent S, Clark AG. 2019. Demographic history of the human commensal Drosophila melanogaster. Genome Biol Evol. 11:844–854. 10.1093/gbe/evz 022.30715331 PMC 6430986 · doi ↗ · pubmed ↗
4Bunnefeld L, Frantz LAF, Lohse K. 2015. Inferring bottlenecks from genome-wide samples of short sequence blocks. Genetics. 201:1157–1169. 10.1534/genetics.115.179861.26341659 PMC 4649642 · doi ↗ · pubmed ↗
5Caicedo AL et al 2007. Genome-wide patterns of nucleotide polymorphism in domesticated rice. P Lo S Genet. 3:e 163. 10.1371/journal.pgen.0030163.17907810 PMC 1994709 · doi ↗ · pubmed ↗
6De Giorgio M, Degnan JH, Rosenberg NA. 2011. Coalescence-time distributions in a serial founder model of human evolutionary history. Genetics. 189:579–593. 10.1534/genetics.111.129296.21775469 PMC 3189793 · doi ↗ · pubmed ↗
7De Giorgio M, Jakobsson M, Rosenberg NA. 2009. Explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa. Proc Natl Acad Sci U S A. 106:16057–16062. 10.1073/pnas.0903341106.19706453 PMC 2752555 · doi ↗ · pubmed ↗
8Deshpande O, Batzoglou S, Feldman MW, Cavalli-Sforza LL. 2009. A serial founder effect model for human settlement out of Africa. Proc R Soc Lond B Biol Sci. 276:291–300. 10.1098/rspb.2008.0750.PMC 267495718796400 · doi ↗ · pubmed ↗