Not Just Ne Ne-More: New Applications for SMC from Ecology to Phylogenies
David Peede, Trevor Cousins, Arun Durvasula, Anastasia Ignatieva, Toby G L Kovacs, Alba Nieto, Emily E Puckett, Elizabeth T Chevy

TL;DR
This paper explores new uses of the SMC model in genetics, expanding its application to include ecological and evolutionary factors.
Contribution
The paper introduces recent extensions to the SMC model, enabling it to account for gene flow, structural variation, and ecological traits.
Findings
SMC-based methods now incorporate isolation with migration and multi-species coalescent models.
Recent computational advances allow SMC to better handle ecological life history traits like selfing and dormancy.
The paper highlights biological discoveries made possible by these extended SMC methods.
Abstract
Genomes contain the mutational footprint of an organism’s evolutionary history, shaped by diverse forces including ecological factors, selective pressures, and life history traits. The sequentially Markovian coalescent (SMC) is a versatile and tractable model for the genetic genealogy of a sample of genomes, which captures this shared history. Methods that utilize the SMC, such as PSMC and MSMC, have been widely used in evolution and ecology to infer demographic histories. However, these methods ignore common biological features, such as gene flow events and structural variation. Recently, there have been several advancements that widen the applicability of SMC-based methods: inclusion of an isolation with migration model, integration with the multi-species coalescent, incorporation of ecological life history traits (such as selfing and dormancy), and many computational advances in…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig. 1
Fig. 2
Fig. 3
Fig. 4|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Set |
- —University of Southern California10.13039/100006034
- —NIH10.13039/100000002
- —Horizon 202010.13039/100010661
- —Marie Skłodowska-Curie
- —Brown University Predoctoral Training Program in Biological Data Science
- —Blavatnik Family Foundation Graduate Fellowship
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolution and Genetic Dynamics · Genetics, Aging, and Longevity in Model Organisms · Genetic diversity and population structure
Introduction
Detecting signals of historic demographic events, such as population size changes, given DNA samples from a population is a fundamental goal of population geneticists, natural historians, and conservation biologists alike. The coalescent (Kingman 1980, 1982a, 1982b) and its subsequent extensions that account for recombination (Hudson 1983; Griffiths 1991) provided the probabilistic foundations for the development of statistical inference methods that estimate demographic parameters from inferred sample genealogies—for a comprehensive overview, see Wakeley (2008). These coalescent models trace ancestral relationships backwards in time for a set of sampled lineages, where a coalescence event occurs when two lineages merge into a single ancestral lineage at some point in the past. A key parameter of interest is the effective population size ( ), which can be conceptualized as the size of an idealized Wright-Fisher population that matches the coalescence rate of the population under study (Fisher 1923; Wright 1931; Charlesworth 2009). An alternative yet convenient perspective is to define as the inverse of the coalescence rate, as this definition can flexibly encapsulate various properties of an evolutionary model.
For nonrecombining segments of DNA—such as mitochondrial DNA (mtDNA) and chloroplast DNA (cpDNA)—the genealogical history of a sample can be represented by a single tree structure. However, a recombination event will split a segment of DNA into two, resulting in distinct genealogical histories for the two recombinant segments. The coalescent with recombination (CwR) was originally formulated as a backwards in time process that models the ancestral relationships for a set of samples while accounting for recombination events (Hudson 1983). The CwR can be conceptualized in two distinct ways: as a collection of local trees or as a graph structure known as the ancestral recombination graph (ARG; Griffiths 1991; Griffiths and Marjoram 1996, 1997). The ARG can encode both coalescence events (where lineages merge) and recombination events (where lineages split), thereby describing the complete evolutionary history for a set of samples at every genomic position. In contrast to its initial formulation, the CwR can alternatively be viewed as a sequential process along the genome rather than backwards in time (Wiuf and Hein 1999a, 1999b). In this perspective, a local genealogy persists along the genome until a recombination event induces a change in tree structure. Inferring the latent ARG of a set of sequences is notoriously difficult, due to the observed data (i.e., the distribution of mutations) not being not very informative about the underlying genealogical history, combined with the fact that the space of possible genealogical histories is extremely large.
To overcome the computational challenges of inference under the CwR, McVean and Cardin (2005) proposed the sequentially Markovian coalescent (SMC) as a tractable approximation to this model, which was further refined by Marjoram and Wall (2006). The SMC framework ushered in a new suite of methods based on coalescent Hidden Markov Models (HMMs; Hobolth et al. 2007; Li and Durbin 2011; Schiffels and Durbin 2014) designed to infer demographic parameters from genomic data. Notably, Li and Durbin’s PSMC method leveraged the SMC in the transition matrix of its HMM to infer the trajectory of humans over time, using a diploid sequence of just one individual. Although the SMC' framework follows the assumptions of the standard coalescent, recent theoretical work has extended the SMC to account for factors such as migration (Wang et al. 2020), structural variation (Ignatieva et al. 2025), and selfing (Strütt et al. 2023). These extensions modify the SMC’s probabilistic model or associated HMM approach to infer quantities beyond .
This review presents these recent extensions to the SMC that broaden its relevance to fields such as natural history and conservation biology. After a brief primer describing the fundamentals of the SMC framework, we divide its extensions into three broad classes: those doing joint parameter inference, those doing lineage partitioning, and those building upon the SMC to explicitly infer sample genealogies (in the form of ARGs). For each class, we describe how recent methods extend the SMC, then review examples of their application to empirical (or simulated) data. Finally, as plausible uses of the SMC expand across fields we discuss how empiricists can be best served by these new approaches, and where we expect the SMC framework will continue to prove particularly useful.
SMC Primer
We begin by briefly defining the SMC model, describing how the SMC is used for inference in the PSMC framework, and reviewing subsequent methodological advances for inferring historical effective population sizes and coalescence rates using SMC-based approaches.
Coalescent Theory Primer
Kingman’s coalescent (Kingman 1980, 1982a, 1982b) is a probabilistic model that describes the ancestral relationships at a single nonrecombining locus for a set of lineages in an idealized Wright-Fisher population (i.e., panmictic with constant size, nonoverlapping generations, and no selection) of N diploid individuals. It arises as the continuous-time approximation to the discrete-generation Wright-Fisher model when the limit , providing a tractable framework for modeling ancestral relationships when N is large. In this limit, rescaling time (t) leads to a continuous-time process, where time is measured in coalescent units going backwards in time (i.e., is the present). This can be converted back to time measured in generations through multiplying by a factor of , the expected time for a pair of lineages to coalesce. Note that, for real populations, the effective population size is in general not equal to the census size of a population, reflecting the fact that the accumulation of genetic diversity is affected by violations of the neutral Wright-Fisher model. Note also that, while the Wright-Fisher model is one of the simplest examples for which sample genealogies converge to Kingman’s coalescent, this is also generally the case for a broad class of finite population models (Möhle 2000).
The relationship among sequences sampled from this population can be described with a genealogical tree. It traces the ancestry for a set of lineages backwards in time, and when a pair of lineages share a common ancestor they are said to coalesce, which is represented through the two lineages merging in the tree. The coalescent characterizes the distribution over the shapes of such trees, and the waiting times between coalescence events. When considering a sample of size two, there is only one possible tree topology, so the coalescent fully characterizes the sample genealogy through the distribution of the time of their most recent common ancestor.
The Coalescent with Recombination (CwR)
While Kingman’s coalescent can be effectively applied independently to unlinked loci, nearby loci have highly correlated genealogies, so the effects of recombination must be modeled explicitly. The CwR was first formulated by Hudson (1983) as a stochastic process backwards in time, which allows for both coalescence and recombination events to occur, resulting in a collection of local trees. Further developments introduced an alternative graph structure to represent a realization of this stochastic process, where lineages can both merge due to coalescence and split due to recombination (Griffiths 1991; Griffiths and Marjoram 1996, 1997). This graph structure, known as the ARG, encodes the complete evolutionary history of a sample, including the marginal genealogy at each genomic position (Griffiths 1991; Griffiths and Marjoram 1996, 1997). Rather than representing the CwR as a stochastic process backwards in time, Wiuf and Hein (1999a) reframed it as a spatial process that sequentially generates correlated local genealogies along the chromosome. Under this spatial formulation, the local genealogy at any given position depends on all preceding genealogies, making the process non-Markovian. This introduces long-range dependencies between genomic regions where nonancestral segments are “trapped” by flanking ancestral segments, rendering the spatial formulation challenging for tractable inference.
The ARG has been described as “the holy grail of statistical population genetics” because it can fully encode the entire genealogical history of the sample, including the complete record of which lineages experienced recombination and the time when these events occurred (Hubisz and Siepel 2020). Given a set of sampled genomes, it would be desirable to infer an ARG that could recover the topologies of local genealogies, the timing of coalescence events, the ages of mutations, and the locations and times of recombination events. However, the complexity and size of the space of possible ARGs make population genetic inference under the CwR extraordinarily challenging (Griffiths and Marjoram 1996; McVean and Cardin 2005). Even when considering a single locus, the number of possible coalescence topologies for n sequences is (Hein et al. 2004, Section 3.2), which grows super exponentially, making likelihood calculations for an ARG infeasible for realistic sample sizes and sequence lengths. Indeed, algorithms that compute likelihoods without relying on approximations to the CwR exist for very small datasets (e.g., ARGinfer; Mahmoudi et al. 2022); nevertheless, for most practical applications, simplifying the structure of the ARG and approximating the full CwR remains necessary for tractable and computationally scalable inference.
SMC as an Approximation to the CwR
McVean and Cardin (2005) introduced the SMC as a model to sequentially generate genealogies along a chromosome. In this sequential view, given a recombination event, the SMC permits the “floating” lineage—i.e., the branch below the recombination event—to re-coalesce with any older lineage on the tree except for itself. Like the CwR, the SMC can also be viewed from a backwards in time perspective, where lineages can only coalesce if they share overlapping ancestral material. The SMC is thus a Markov process both backwards in time and along the genome, since the genealogy at any given genomic position depends only on the genealogy at the previous position. Despite this simplification, the SMC generates genealogies with very similar structure, correlation, and patterns of genetic diversity as the CwR, while being much more computationally tractable for inference (McVean and Cardin 2005).
Marjoram and Wall (2006) introduced a modification to the SMC, known as the SMC' (Fig. 1), which incorporates an additional class of coalescence events, making it a closer approximation to the CwR while still being Markovian. Specifically, in the sequential view, the SMC' also allows the “floating” lineage to re-coalesce with itself, which is not permitted in the original SMC. From the backwards in time perspective, the SMC' allows coalescence between lineages with overlapping or adjacent ancestral material, while the SMC only permits coalescence between lineages with overlapping ancestral material. This difference means that after a recombination event, the SMC always generates a new marginal genealogy distinct from the previous one, whereas the SMC' allows for self-coalescence that may preserve the previous genealogical structure in the new marginal genealogy (Marjoram and Wall 2006).
Illustration of the SMC. a) SMC process as a sequence of genealogical trees over a genomic region of length L. Each local tree persists until a recombination event (grey square node g) occurs at a breakpoint bi. The branch between the recombination node and the “floating lineage’s” parent node is then pruned. The genealogy transitions to the next tree along the sequence when the floating lineage is regrafted (colored arrow) via a coalescence event (circle c node) colored by the types of recombination events described in b. Dashed horizontal lines at the height of ancestral nodes I, H, and G, highlight that the new genealogies are formed due to the coalescence at the preceding locus. b) Possible types of genealogical transitions conditioned on the recombination event (grey square node g). The floating lineage coalesces with one of its ancestral lineages to form the next genealogy along the sequence (see a and c). If the floating lineage coalesces with the lineage highlighted in sky blue, the next tree remains unchanged. Coalescence with the lineage highlighted in orange alters the height of the next tree, whereas coalescence with one of the lineages highlighted in green changes the next tree’s topology. Where the floating lineage coalesces depends on the length of each epoch (Ti, delineated by alternating backgrounds) and the number of ancestral lineages i. The histogram shows the number of lineages i available for coalescence in each epoch Ti, starting from the time of the recombination event g; the coalescence rate in epoch Ti is i/2Ne. c) Illustration of an ARG for the corresponding genealogical trees depicted in a. Black circle nodes A-D represent sample nodes, and the remaining ancestral black nodes are alphabetically labeled in ascending order based on relative age. Grey square nodes denote recombination events and are labeled by their corresponding breakpoint bi in a. The left and right edges emerging from recombination nodes correspond to the local genealogies to the left and right of breakpoint bi. The orange edge represents the recombination event in the first tree, which shortens the height of the second tree. The green edge represents the recombination event in the second tree, which alters the topology of the final tree.
The key insight of both the SMC and the SMC' models is that by making each local genealogy dependent only on the genealogy at the previous locus, the process becomes Markovian. This Markovian approximation substantially reduces the space of possible ARGs, providing a much more tractable framework for statistical inference. Moreover, by studying the joint distribution of coalescence times in a two-locus Markov chain model, Wilton et al. (2015) and Hobolth and Jensen (2014) demonstrated that both the SMC and SMC' are good approximations to the full ARG, with the SMC' being notably more accurate. Furthermore, they showed that the joint distribution of pairwise coalescence times at sites flanking recombination breakpoints under the SMC' is identical to that under the full ARG (Wilton et al. 2015). For these reasons, the SMC' has become the more popular choice for inference methods, and throughout this review, when we refer to “the SMC” and “SMC-based methods” we are, more formally, discussing the SMC'. We describe the SMC' algorithm in Algorithm 1; we note that reversing the order of steps 4 and 5 yields the original SMC formulation described in McVean and Cardin (2005).
Algorithm 1:: Generating genealogies under the SMC' (Marjoram and Wall 2006).
Statistical Inference Using the SMC
The most common application of the SMC is to estimate changes in over time from sequencing data. To infer model parameters under the SMC conditional on the data, methods such as PSMC or MSMC (and many others reviewed in Sections 2 and 3; Li and Durbin 2011; Schiffels and Durbin 2014) use a Hidden Markov Model (HMM). HMMs are a powerful and flexible inferential framework; indeed, one of the most useful features of the SMC approximation to the CwR is that it is Markovian and therefore allows the use of HMMs—for more about HMMs in this context, see Durbin et al. (1998).
PSMC models the density of heterozygous sites along the genome to infer the coalescence time at every genomic position using sequence data from a single diploid individual (Fig. 2). Since only two sequences are analyzed at a time, each genealogical tree along the sequence contains only two lineages and one coalescence time—i.e., the time to the most recent common ancestor (tMRCA) for a diploid individual’s two homologous chromosomes. Time—the axis along which lineages in each tree coalesce backwards in time from the present—is partitioned into discrete time intervals to form the hidden state space of a discrete HMM. The observations in the data are whether or not alleles match (i.e., the genotype is homozygous or heterozygous) at each site in the sequence, where a heterozygous site indicates that a mutation arose more recently than the tMRCA of the individual’s two haploid lineages. The emission probabilities, based on the infinite sites mutation model, describe the probability that the observation is homozygous or heterozygous, given that the lineages coalesced in a particular time interval, and are determined by the population-scaled mutation rate . The transition probabilities, derived from the SMC framework, describe the probability that a locus whose lineages coalesced in one particular time interval is adjacent to a locus whose lineages coalesced in the same or different time interval, and are governed by the population-scaled recombination rate and the given evolutionary model.
Overview of PSMC framework. a) PSMC is a Hidden Markov Model (HMM) used to infer piecewise constant historical population sizes from the spatial distribution of heterozygous sites between two haploid sequences (grey lines on the x-axis plane). Hidden states are discrete intervals of times to the most recent common ancestor (tMRCA), shown as dashed lines on the y-axis plane. Transitions between hidden states represent changes in the tMRCA of the two sequences, reflective of ancestral recombination events (dotted vertical lines). Observations consist of the sequence of diploid genotypes at each site, where a yellow sphere indicates a heterozygous site (i.e., the mutation occurred after the tMRCA) and its absence indicates a homozygous site. Emissions correspond to the probability of observing these mutations given a tMRCA for the two sequences. b) Using the Baum-Welch algorithm to estimate model parameters, PSMC reconstructs a piecewise constant ancestral population size history. The crux of PSMC’s inference relies on the inverse relationship between the frequency of coalescence events (grey dashed line) in each time interval (alternating background) and the effective population size (black line) during that time period.
PSMC uses the Baum-Welch algorithm—a special case of the expectation-maximization algorithm—to estimate its parameters, namely the piecewise constant coalescence rates and the population-scaled recombination rate ρ. In the expectation step, the forward-backward algorithm computes the posterior probability of coalescence in each hidden state for each observation, this posterior distribution is then used to construct the expected transition matrix. In the maximization step, the expected transition matrix is used to update the model parameters. The updated parameters are subsequently used to recompute the expected transition matrix, and the process iterates until convergence. The associated time complexity scales linearly with the number of observations (i.e., sequence length) and quadratically with the number of hidden states. After convergence is achieved, is estimated from the PSMC’s piecewise constant coalescence rates as the inverse of the coalescence rate, after appropriately scaling the time units using the per-site per-generation mutation rate μ and generation time.
Inferring Demographic Histories
The ability to infer changes in historical population sizes from a single diploid genome sequence has made PSMC a very commonly used tool in the field of evolutionary genetics (Hilgers et al. 2025). Building on this foundation, several methods have extended PSMC’s capabilities to improve inference of historical population sizes over time. The multiple sequentially Markovian coalescent (MSMC) was introduced as an extension that takes multiple (typically as many as eight) diploid genome sequences as input (Schiffels and Durbin 2014). MSMC estimates the time of the first coalescence event among all input sequences, offering two key advantages over its predecessor: for sequences sampled from the same population, the larger sample size enables greater resolution in population size inference for more recent time periods; for sequences sampled from two different populations, it infers the relative cross coalescence rate—a measure of genetic separation between populations based on the ratio between cross-population and within-population coalescence rates. MSMC also employs a more accurate approximation to the CwR (Marjoram and Wall 2006, i.e., SMC') compared to the original PSMC implementation (McVean and Cardin 2005; Li and Durbin 2011). Building on MSMC, MSMC2 was introduced as a simpler yet more powerful approach (Wang et al. 2020). Rather than inferring the first coalescence event among input sequences, MSMC2 runs PSMC on all possible pairs of sequences and uses composite-maximum likelihood to infer model parameters.
Another class of methods leverages the sequentially Markov conditional sampling distribution (SMCSD), which describes the probability of observing a new haplotype given a set of previously observed haplotypes and a demographic model (Paul and Song 2010; Paul et al. 2011; Steinrücken et al. 2013). diCal employs the SMCSD to infer recent population size history using up to 10 genomes (Sheehan et al. 2013), while diCal2 extends this framework to infer more complex demographic histories involving population splits, admixture, and migration (Steinrücken et al. 2019). SMC_++_ builds on these approaches by combining the SMCSD with explicit modeling of the site frequency spectrum (SFS), enabling analysis of hundreds of unphased genomes (Terhorst et al. 2017). Recently, Terhorst (2025) introduced PHLASH, a Bayesian extension of PSMC and SMC_++_ that infers historical population size changes by placing a prior on the discrete time intervals and sampling from the posterior distribution, thereby removing the need for users to discretize time.
Several other methods have introduced novel approaches to infer population size histories under the SMC framework. SMCSMC uses sequential Monte Carlo to estimate the posterior distribution of the hidden continuous time Markov process from four diploid human genomes, and avoids biases in the expectation-maximization algorithm by employing variational Bayes to model uncertainty in rare events (Henderson et al. 2021). CHIMP scales to hundreds of genomes by modeling the latent space as either the height or length of the tree, obtaining the Markov transition probabilities through numerically solving systems of differential equations, and using composite likelihood to scale to large sample sizes (Upadhya and Steinrücken 2022). Additionally, a Gaussian process-based Bayesian nonparametric method has been developed that avoids discretization of the parameter space while providing uncertainty estimates for inferred parameters (Palacios et al. 2015).
Inferring Coalescence Times to Study Natural Selection
While SMC-based methods have primarily focused on inferring demographic histories, the sequence of coalescence times inferred along the genome also provides valuable information for studying natural selection. For example, regions under recent positive selection are enriched for young tMRCAs, whereas regions under long-term balancing selection are enriched for ancient tMRCAs.
PSMC’s posterior decoding scales quadratically in the number of hidden states, making it impractical to apply pairwise to a large number of sequences. By exploiting symmetries in the discretized SMC transition process, Harris et al. (2014) showed that the PSMC algorithm can be decoded in linear time. The Ascertained Sequentially Markovian Coalescent (ASMC; Palamara et al. 2018) utilizes this linear decoding, combined with efficient dynamic programming, to run pairwise tMRCA inference on biobank scale data, finding novel signatures of recent positive selection and illuminating the landscape of selection on complex traits. Further improving the efficiency of PSMC’s posterior decoding (Schweiger and Durbin 2023) introduced Gamma-SMC. Whereas ASMC’s decodes in linear time, Gamma-SMC decodes in constant time. This speedup is accomplished by representing the hidden distribution over the coalescence times with a Gamma distribution, and computing a forward/backward pass through the observed data by using a flow-field to approximate the transition density. Furthermore, the gamma representation allows the tMRCAs to be modeled in continuous time and is thus more robust to misspecification.
SMC with Joint Parameter Inference
Several extensions of the SMC framework can infer not just a vector of inverse coalescence rates (i.e., ) through time, but also vectors of other biological parameters of interest, including mutation and recombination rates, mating system properties, and sequence constraint (Fig. 3).
Illustrative example of joint parameter inference. The Hidden Markov Model (HMM) infers both a discretized coalescence time history (vector of τi) and a discretized recombination rate map (vector of ρh and ρl, indicating high and low recombination respectively). Hidden states are tuples of all combinations of coalescence time bins τi and recombination rate bins ρj. An HMM path is depicted underneath the observed diploid genotypes (“0” for homozygous and “1” for heterozygous), which corresponds to a sequence of hidden states (pairs of {τi,ρj}) emitted by the sequence of observations.
SMC-based Inference with Mutation and Recombination Rate Heterogeneity
PSMC brought large inferential power to genome projects by allowing the estimation of historical population sizes from a single unphased diploid genome. The integrative Sequentially Markovian Coalescent (iSMC) extends what can be learned from such data by extending the SMC framework to also estimate a recombination map along the genome (Barroso et al. 2019). In the original PSMC framework, the transition rates between coalescent time bins assumes a constant, global recombination rate parameter. iSMC models spatial variation in recombination rate using a Gamma distribution with k discretized categories, allowing for the inference of local recombination rates along the genome.
iSMC is not only able to distinguish between the signals of variable population size and recombination rate in SNP data, but provides more accurate estimates when inferring both concurrently. Although iSMC produces lower resolution recombination maps than linkage disequilibrium (LD) based methods, iSMC is more accurate when only a small number of samples are available, or in the presence of historical population size variation (Barroso et al. 2019; Dutheil 2024). However, iSMC introduces biases if the number of recombination categories k is poorly selected (Dutheil 2024). iSMC is also limited to estimating the background recombination rate and cannot detect finer-scale variation such as recombination hotspots. Further development is needed for small genomes and those with high gene conversion rates (Schweizer et al. 2021; Dutheil 2024).
iSMC’s ability to analyze single genomes has enabled the estimation of recombination maps for individuals from diverse human populations, as well as for archaic hominins like Neanderthals and Denisovans (Barroso et al. 2019). These findings show that the recombination landscape divergence aligns closely with the divergence of both extant and extinct lineages. This framework has also enabled the first recombination map estimates for a wide range of nonmodel and threatened species (Robinson et al. 2021; Nouhaud et al. 2022; Cui et al. 2024; da Fonseca et al. 2024; Unneberg et al. 2024). These recombination maps have helped highlight the significance of genome-wide recombination rate variation in local genome evolution. For example, iSMC has been used to show that low-recombining genomic regions in small pelagic European sardines contained increased genetic differentiation between populations, suggesting that local recombination rates can influence population structure (da Fonseca et al. 2024). Low-recombining genomic regions have also been shown to be of particular concern in threatened species, as they are more vulnerable to the effects of linked selection. Directional selection reduces at linked sites due to genetic hitchhiking (i.e., Hill-Robertson effect; Charlesworth et al. 1993; Nordborg et al. 1996), reducing local levels of genetic diversity proportionally to the strength of selection and inversely to the local recombination rate (Charlesworth 2009). For regions of the genome with lower recombination rates, this effect is particularly pronounced, where linked selection can create long-lasting reductions in both and genetic diversity. Low-recombining regions were more likely to contain runs of homozygosity, a common conservation measure used to indicate inbreeding, in the critically endangered Chinese Bahaba fish (Cui et al. 2024). These regions were also more likely to contain high-impact substitutions and highly disruptive variants. Higher recombination rates in smaller chromosomes and distal regions of larger chromosomes helped explain the increased heterozygosity in these regions in the critically endangered California condor (Robinson et al. 2021). However, purging of deleterious alleles, a common advantage to prolonged small , appears to be more efficient in low-recombining regions in hybrid ant genomes (Nouhaud et al. 2022).
iSMC has been extended to estimate mutation rates across the genome alongside demographic histories and recombination rates (Barroso and Dutheil 2023). This approach helps partition the relative contributions of genetic drift, linked selection, and local mutation rates to the evolution of genetic diversity. For instance, this extension revealed that local mutation rates primarily drive patterns of genetic diversity in Drosophila melanogaster, with linked selection playing a secondary role (Barroso and Dutheil 2023). While iSMC models heterogeneity in mutation rate across the genome as additional hidden states using a Gamma distribution with m discretized categories in an HMM, other strategies have been developed to overcome PSMC’s assumption of a genome-wide constant mutation rate.
Building on this idea, PSMC+ was introduced by Cousins et al. (2024a) to model the effects of background selection, by extending PSMC’s inference framework to explicitly account for local variation in mutation rates and its impact on estimates of . A classic model of the effect of background selection on pairwise diversity re-scales the effective population size by a factor B to account for loss of diversity due to linkage with alleles experiencing purifying selection (Charlesworth et al. 1993). As mentioned in Section 1.4, PSMC’s emission probabilities are governed by the population-scaled mutation rate, which is assumed to be constant along the genome, and describes the probability of observing a mutation given a coalescence time. PSMC+ modifies PSMC’s emission model to account for local variation in mutation rate by scaling θ by a factor of for each observation i, which reduces bias for inferring if the true scaling factor is known. In doing so, PSMC+’s emission probabilities now describe the probability of observing a mutation given a coalescence time and local mutation rate scaling factor. The authors show via simulations that this procedure improves the accuracy of estimates of through time in the presence of background selection. Using a simple B-map based on distance to exons, or simply rescaling post hoc, performed similarly to the high-resolution Murphy B-map in humans, suggesting that minimal prior knowledge of background selection is required. When not accounting for background selection, the ratio of effective population sizes between X-chromosomal and autosomal regions varies substantially over time, although this could be due to life history or mutation rate variation (Cousins et al. 2024a). It should be noted that while Cousins et al. (2024a) primarily focused on the effect of background selection, PSMC+’s emission model can be used to account for any local variation in mutation rate.
Detecting Life History Traits with the SMC
SMC-based methods inherit neutral assumptions from Kingman’s coalescent (Section 1.1). However, many systems—especially plants and invertebrates—exhibit ecological life history traits that violate two key assumptions the Wright-Fisher model: sexual reproduction through random mating, and nonoverlapping generations. Recent extensions to the SMC framework have addressed these challenges, yielding effective models for phenomena such as self-fertilization and seed- (or egg-) banking.
Self-fertilization is a nonrandom mating system prevalent in plant and fungal species in which an individual’s gametes are more likely to fuse with its own than those of another individual (Nordborg 2000). Self-fertilization reduces genetic diversity and thus decreases the effective population size, resulting in genealogical trees with shorter branches and fewer mutations and recombination events than those generated under the standard SMC (Nordborg 2000). Seed (or egg) banks are an adaptation to unpredictable environments whereby seeds remain dormant for multiple generations before germinating (Cohen 1966). Seed banks break the assumption of nonoverlapping generations by allowing past generations to contribute genetically to the present. This increases the effective population size, resulting in genealogical trees with longer branches and more mutations and recombination events than those generated under the standard SMC (Lennon et al. 2021).
To explicitly infer these life history traits alongside , Sellinger et al. (2020) developed the ecological sequentially Markovian coalescent (eSMC). The eSMC can be viewed as a rescaled extension of PSMC: it leverages the fact that self-fertilization and seed banking alter the recombination and mutation processes in opposite directions and distort the ratio of recombination to mutation rates. By exploiting this deviation, the eSMC is able to jointly infer a piecewise constant demographic history along with either a fixed self-fertilization rate or a germination rate.
Applying eSMC to Swedish and German Arabidopsis thaliana populations, Sellinger et al. (2020) showed that self-fertilization rates introduce minimal bias into historical estimates compared to traditional SMC-based inference methods (Durvasula et al. 2017). Moreover, the eSMC rejected the previously-proposed existence of seed banks in these populations. In contrast, when applied to Daphnia pulex, eSMC inferred an egg bank lasting 3–18 generations, consistent with empirical observations (Brendonck and De Meester 2003). A standard SMC approach neglecting egg-banking would have reported a biased trajectory, and obscured the role of egg banks in maintaining genetic variation.
The eSMC has also been extended to allow for the joint inference of historical population sizes, recombination rates, and selfing (or germination) rates by allowing these parameters to vary through time (teSMC; Strütt et al. 2023). This enables the dating of transitions from outcrossing to selfing based on temporal changes in the ratio of recombination to mutation rate. Returning to A. thaliana, Strütt et al. estimated that a transition to self-fertilization occurred approximately 600 kya. Since teSMC can distinguish historical changes in from changes in self-fertilization rates, it also detected a previously-unrecognized population decline coinciding with the transition from outcrossing to selfing. This demonstrates how incorporating time-varying vectors of ecologically-relevant parameters can enhance the resolution of SMC-based inference and our understanding of mating system evolution.
eSMC has also been extended to accommodate systems where a few individuals produce a large proportion of offspring (SMβC; Korfmann et al. 2024). This can be caused by skewed offspring distributions—as are seen in plants, invertebrates, prokaryotes, and fish—or strong selection events. SMβC employs the β-coalescent to allow long range dependencies between coalescent trees along the genome, which are indicative of multiple merger events. SMβC has been shown to distinguish between the influences of skewed offspring distribution and positive selection while simultaneously modeling historical population sizes to a high degree of accuracy (Korfmann et al. 2024).
SMC with Lineage Partitioning
Under the original SMC model, any lineage may undergo recombination, and that lineage is permitted to re-coalesce with itself or any (older) lineages with overlapping or adjacent ancestral material in the current genomic interval. However, it is possible to define an SMC-like process where lineages are partitioned into different labeled classes—e.g., by carrier status for a chromosomal inversion, or membership of separated populations. In these models, lineages are imagined to have different colors throughout time, where the color is reflective of the lineage’s class membership at any given time. Recombination and/or coalescence events can then be modulated to occur at different rates between lineages of the same or different color. We refer to this type of modification as the SMC with lineage partitioning (Fig. 4).
Illustrative example of lineage partitioning. a) Lineages are partitioned into colored classes (vermilion or blue) where cross-color coalescence is restricted until after the barrier (black dashed line) arose. b) Opportunities for coalescence in a lineage partitioning model, conditioned on a recombination event represented by the grey square node g, with outcomes stratified by the colored class of g (vermilion or blue). Starting from the time of the recombination event, the shading behind each branch indicates the colored class identity of the ancestral lineage available for coalescence. With respect to each colored class, where the floating lineage coalesces depends on the duration of each epoch (Ti, delineated by alternating backgrounds), and the number of ancestral lineages from the same colored class available during that epoch. The histogram shows the number of ancestral lineages i available for coalescence in each epoch stratified by colored class, Ti, which is proportional to the coalescence rate i/2Ne.
Modeling and Detecting Structural Variation with the SMC
Peischl et al. (2013) introduced an SMC variant that uses a lineage partitioning approach to model polymorphic inversions. Each lineage is assigned to a specific color—standard or inverted—and coalescence events are restricted to occur only within a color. While coalescence is not allowed between lineages of different colors, recombination events are not strictly limited to lineages of the same color. Accordingly, the model incorporates both homokaryotypic recombination, where lineages share the same chromosomal arrangement, and heterokaryotypic recombination, in which gene flux (i.e., the movement of genetic material between different chromosomal arrangements; Navarro et al. 1997) is modulated by the location of the event within the inversion (Novitski and Braver 1954). The latter occurs at a much lower rate than the former, but can still occur through mechanisms such as multiple crossovers (Ashburner 1989) or gene conversion (Chovnick 1973). Separating lineages by color reflects the underlying biology of inversions, where recombination is highly suppressed between inversion carriers and noncarriers.
Although this model has only been derived for a single population, and not yet applied to any empirical data, it has generated novel results about LD. Peischl et al. (2013) studied the correlation in coalescence times at pairs of sites across an inverted region generated by their method. They found that, unlike standard models of noninverted regions, where LD decays with the physical distance between sites, an inverted region can maintain long-distance associations between sites. This pattern of LD arises because gene flux events are effectively migration events between the standard and inverted classes of lineages, breaking down the associations between sites. However, unlike migration events, the rate of gene flux events varies spatially along the inversion. Gene flux is more likely to occur in the interior of the inversion, and less likely near the breakpoints. By extending the SMC to model inversions, Peischl et al. (2013) demonstrated that the strength of LD between sites on an inversion depends on both the physical distance between sites and their spatial location on the inversion.
In contrast to modeling the suppression of recombination between inversion carriers and noncarriers, Ignatieva et al. (2025) used the SMC to detect distortions in genealogies resulting from the localized suppression of recombination induced by structural variants (SVs), thereby enabling their identification and analysis in empirical data. The crux of the method (implemented in the software tool DoLoReS) is that, in the case of an inversion, recombination is suppressed in individuals with only one copy of the inverted orientation. If a recombination event occurs on a lineage within the clade of chromosomes carrying the inversion, it is highly likely to coalesce with another lineage within the clade of inverted samples. Similarly, if a recombination event occurs on a branch that does not harbor the inversion, it will likely not coalesce with any lineages in the clade of inverted samples. Consequently, samples that harbor an inversion should persist as a clade in the ARG for longer stretches of the genome than would otherwise be expected.
To identify such signals of locally suppressed recombination, Ignatieva et al. (2025) used the SMC to derive the distribution of the genomic span for a given clade, conditional on the local tree where the clade first appears. For a given input genealogy, DoLoReS then computes the genomic span of each observed clade, and uses the derived distribution to compute a p-value which indicates whether the clade has a longer genomic span than expected.
Using a genealogy reconstructed from human sequencing data (Speidel et al. 2019), DoLoReS identified 50 regions of the genome with strong signals of suppressed recombination. This includes the well-known inversion on 17q21.31, and a known region of complex structural variation on 16p12.2, as well as a novel 760 kb inversion on 10p22.3 which is common in South Asian populations—and which reads-based methods have failed to detect due to lack of strong signal from reads spanning the inversion breakpoints. The method also identified a number of regions with locally suppressed recombination that did not show strong evidence of structural variation; further analysis revealed that the observed signals may arise through expression-dependent suppression of crossovers within genes expressed in meiosis.
SMC-based Inference with Population Structure
A widely-adopted method, MSMC-IM (Wang et al. 2020) uses a lineage partitioning strategy to model lineages from two populations and infer a time-dependent migration rate estimate between them. Input sequences are colored by which of the two populations they were sampled from. Rather than restricting coalescence between lineages of different colors, as in the case of inversions (Peischl et al. 2013), MSMC-IM expects fewer coalescence events to happen between lineages from different populations, as these events reflect migration. MSMC-IM uses the SMC-based method MSMC2 (Malaspinas et al. 2016; Wang et al. 2020) to infer three separate rates of coalescence: within each population and between the two populations. It then uses these three histories to fit an isolation-migration model (IM; based on Hobolth et al. 2011). The fitting process effectively converts the three coalescence rates inferred from MSMC2 into effective population size histories for each population and the migration rate between them. MSMC-IM has been applied to both intra- and inter-specific questions (Wang et al. 2022; Lescroart et al. 2023; Crossman et al. 2024), leading to novel population genomic and phylogeographic inference. Specifically, researchers have used MSMC-IM to estimate the timing of divergence or secondary contact and to quantify the rate or magnitude of these events, thereby linking demographic history to climatic, ecological, or anthropogenic change. Comparing these parameters among population pairs across a species’ range enables inference of geographic patterns and the spatial scale of evolutionary processes, which can be especially informative when integrated with knowledge of habitat type. Below, we highlight specific use cases.
Mutation rate and generation time are key parameters for SMC interpretation and can substantially affect the magnitude and timing of inferred demographic events (Kovacs et al. 2025). Yet accurate empirical estimates are not always available for nonmodel species. One such species is the mosquito Aedes aegypti for which a human specialist Ae. aegypti aegypti is a major vector for several viral pathogens. MSMC-IM was used to understand the evolution of the human specialist population from the generalist Ae. aegypti formosus; however, the authors first needed to estimate these key parameters. To do this they created MSMC-IM curves from African and South American populations of Ae. aegypti aegypti, and calibrated the curve based on the timing of the human slave trade (Rose et al. 2023). Their calibration enabled them to identify that deep population structure in Africa accelerated as the landscape dried and water became associated with human settlements, thereby giving rise to the human specialist. Additional sampling from this system identified the phylogeography of global range expansion of this anthrophilic vector (Crawford et al. 2024).
MSMC-IM is also being applied to study introgression, i.e., the transfer of genetic material between previously isolated populations through hybridization and subsequent back-crossing with one of the parental populations (Dagilis et al. 2022). While unidirectional introgression from polar into brown bears has been previously inferred, the timing was unknown. Analysis of the migration rate curves from MSMC-IM detected the timing of speciation (500 kya–400 kya), and a previously unidentified ancient hybridization affecting global brown bear populations (100 kya; Wang et al. 2022). Similarly, MSMC-IM signatures between wolf and coyote populations identified both speciation and recent hybridization with on-going gene flow that aids in defining evolutionary units for conservation (Vilaça et al. 2023).
The use of MSMC-IM has been limited for questions of selection, but has supported downstream analyses. For example, Puckett et al. (2023) inferred pairwise migration rates among four populations of American black bears to parameterize a set of SLiM (Haller and Messer 2019) models with varying selection coefficients. They found contemporary allele frequencies of the causative locus for brown coat color were best described under a model with a low but positive selection coefficient. Wei et al. (2023) compared the population divergence time due to colonization of the Andes mountains of an endemic tomato species, Solanum chilense, to the age of selective sweeps within the population. They found that sweeps in the highlands population were generally closer to the time of colonization of the novel environment than in the age of sweeps in lowland populations.
Shchur et al. (2022) introduced MiSTI as an alternative SMC-based approach to account for population structure, that can infer changes in historical effective population sizes for two populations while correcting for the confounding effects of migration. The method combines coalescence rates inferred from PSMC with the joint site frequency spectrum (jSFS) from two diploid individuals to fit a time-dependent migration model. By incorporating both the PSMC results and the jSFS, MiSTI can disentangle the complex interplay between population size changes and migration, providing estimates of migration-corrected historical population sizes, split time between the populations, and asymmetric migration rates. The crux of MiSTI’s approach lies in its modeling distinctions between different conceptualizations of effective population sizes. Specifically, Shchur et al. (2022) modeled the distinction between the ordinary effective population size—i.e., as inferred by methods like PSMC—and the local effective population size. The ordinary effective population size of an admixed population is comprised of the local effective population sizes of the parental populations, which explains why PSMC results can be biased in the presence of population structure (Mazet et al. 2016). In contrast, the local effective population size represents the effective population size considering only unadmixed individuals within a parental population—i.e., the inverse coalescence rate conditional on both lineages belonging to the same population at a given time. By modeling the relationship between these two types of effective population sizes as a function of the split time and postseparation migration between the two populations, MiSTI can disentangle the effect of migration on changes in historical population sizes over time.
Rather than inferring demographic histories with migration, Steinrücken et al. (2018) developed diCal-ADMIX, an SMC-based method that conditions on a demographic model to identify genomic regions resulting from introgression. This model-based approach analyzes three populations: a target population (i.e., the proposed recipient of introgression), a putative source population, and a control population presumed to have no genetic contribution from the source population. diCal-ADMIX implements an HMM based on the SMCSD (Paul and Song 2010; Paul et al. 2011; Steinrücken et al. 2013) and requires a prespecified demographic model describing the joint evolutionary history of all three populations. The crux of this method relies on the SMCSD framework: at each locus, haplotypes from the control and source populations form two unchanging “trunk” genealogies, while target population haplotypes are iteratively sampled and absorbed by one of these trunk haplotypes. When iteratively sampling the target haplotypes, lineages belonging to different populations cannot coalesce unless a gene flow event occurs with a given rate at some point in the past. During a gene flow event, the target haplotype is absorbed by one of the two trunk genealogies, where the underlying dynamics of this absorption process are governed by the demographic model. Thus, the trunk genealogy and time of absorption for a target haplotype can change from locus to locus, reflective of the fact that target haplotypes are realized as a mosaic of the source and control populations. diCal-ADMIX models this absorption process for an additional target haplotype, with hidden states representing both the potential absorbing trunk populations and the timing of absorption, and transition and emission probabilities previously defined for the structured SMCSD with migration (Steinrücken et al. 2013). After decoding the HMM, the method marginalizes over absorption times and groups target haplotypes by trunk population, resulting in the probability that each target haplotype derives from either the control population or was introgressed from the source population for each locus. When applied to recover Neanderthal introgressed regions in European and East Asian populations (using African populations as the control), diCal-ADMIX confirmed the well-established pattern of reduced introgressed ancestry on the X chromosome compared to autosomes. Through subsequent gene enrichment analyses and simulations, Steinrücken et al. (2018) demonstrated that this depletion of Neanderthal ancestry is more consistent with the mutational load hypothesis (i.e., higher genetic load in Neanderthals due to smaller effective population size) rather than Dobzhansky–Müller incompatibilities (i.e., invoking reproductive isolation) between archaic and modern humans.
In contrast to the above methods that infer gene flow between given populations, Cousins et al. (2025) introduced cobraa, a method that detects the presence or absence of gene flow from an unsampled population using a single diploid sequence. cobraa is an extension of the PSMC algorithm that explicitly incorporates a model of population structure in the transition matrix of its HMM. This model of population structure assumes the given population was panmictic until time , at which point a fraction γ of lineages instantaneously migrate to an unsampled population. This unsampled population remains isolated until time , after which all lineages from both populations merge into a single panmictic ancestral population. Using the Baum-Welch algorithm (see Section 1.4), cobraa infers both the historical population size changes for the sampled population, the admixture fraction γ, the divergence and admixture times, and the size of the unsampled population. Importantly, cobraa does not assume a priori that the input diploid sequence was sampled from a structured population. Instead, it infers a structured model and an unstructured model (as in PSMC) and compares the log-likelihoods between them to determine which better explains the data. When the structured model provides a better fit, Cousins et al. (2025) introduced a complementary HMM, cobraa-path, which is conceptually similar to cobraa but with hidden states corresponding to both the discrete coalescence time intervals and ancestral lineage path, with modified transition and emission probabilities. The key innovation of cobraa-path is its ability to infer local ancestry states along the genome, identifying regions where none, one, or both lineages traced their ancestry through the unsampled population. In their analysis, Cousins et al. (2025) found that a model where an ancestral population diverges 1.5 Mya and subsequently admixes 300 kya with an unsampled population in a ratio of 80:20% fits human genomic data substantially better than a model without deep admixture. Furthermore, after inferring the genomic regions derived from this admixture event, the authors discovered a negative relationship between this admixed ancestry, the distance to the closest coding sequence, and the high-resolution B-map (which quantifies the strength of background selection). They suggest this pattern results from the purging of the unsampled population’s ancestry following the admixture event.
Using the SMC to Study Speciation
While most SMC-based methods focus on studying the evolutionary history of a single population, several approaches have adapted the SMC framework to study species-level relationships. These methods employ a lineage partitioning strategy defined by the underlying species tree, where lineages are colored by species identity and speciation events create barriers to coalescence.
This strategy was first implemented in CoalHMM (Hobolth et al. 2007; Dutheil et al. 2009), which is parameterized by a three-taxon species tree. Under this model, there are four possible coalescent histories: one lineage sorting, where lineages from the two sister species coalesce (or sort) into their most recent common ancestral species, and three incomplete lineage sorting, which are genealogically discordant with the species tree. CoalHMM models the sequence of local coalescent histories from a three-species alignment, with ancestral states polarized by an additional outgroup species. Unlike standard phylogenetic methods that infer divergence times (i.e., the mean cross-species coalescence time inferred from sequence divergence estimates; Patterson et al. 2006) using genomic data and fossil calibrations (Yang 2007; Ronquist et al. 2012; Bouckaert et al. 2019), CoalHMM directly infers speciation times (i.e., the time of the last gene flow event between the two species’ ancestors; Patterson et al. 2006) and ancestral effective population sizes by leveraging information from local coalescent histories constrained by the species tree. Divergence times inferred from genomic data are typically older than speciation times, as short intervals between speciation events and/or large ancestral effective population sizes can result in coalescent histories of incomplete lineage sorting, where loci from two species coalesce deeper in the past than their actual speciation time (Rivas-González et al. 2023). However, CoalHMM yields biased parameter estimates for the three-species case because coalescent times are modeled as single time points on a given branch (Dutheil et al. 2009).
To address these limitations, Mailund et al. (2011) reparameterized CoalHMM for a two-taxon species tree, where lineages from different species remain restricted from coalescing until the speciation event, but now time is split into discrete bins in an analogous way to PSMC. This two-species model enables more accurate inference of the speciation time and effective ancestral population size (Mailund et al. 2011), and has been extended to incorporate postspeciation migration between species (Mailund et al. 2012).
Building on both the three-species and two-species CoalHMM models, TRAILS jointly infers the topology and coalescent times of local genealogies from a three-species alignment (Rivas-González et al. 2024). Unlike many SMC-based methods that operate on pairs of lineages (or over all possible pairs of lineages), TRAILS infers local genealogies by modeling both the topology (based on the four possible coalescent histories) and the timing of two coalescence events (using discrete time intervals along the species tree). This approach enables unbiased inference of speciation times and ancestral effective population sizes for the three-species tree without requiring fossil calibrations. Moreover, TRAILS can reconstruct the ARG for the three species from its posterior decoding, allowing researchers to not only recover demographic parameters associated with the speciation history, but also study the underlying process of speciation itself.
Diverging from the above methods, which rely on sequence data from two or three species, Cousins et al. (2024b) demonstrated the remarkable applicability of PSMC for studying the speciation history of our own species. By analyzing the posterior distribution of coalescence times from PSMC, the authors showed that a diploid genome sampled from a present day human, chimpanzee, or gorilla does not fully coalesce until beyond 10Mya, and that PSMC is relatively robust to model violations in these ancient time periods. This indicates that coalescent-based inference can be extended much further into the past than previously thought, and in particular that a diploid primate genome contains enough information to elucidate the speciation process with its closest relatives.
SMC in ARG Inference
The SMC (and its extensions) is a full probabilistic model describing the distribution of sample genealogies. While it is reasonably simple to simulate a genealogy under such a model (as described in Section 1, Algorithm 1), the problem of inferring or reconstructing plausible genealogies conditional on a given sample of sequences is notoriously difficult, due to the huge search space involving both discrete (ARG topology) and continuous (branch length) components. Tremendous progress in ARG inference has been made in the past decade using various simplifying approximations to the SMC (for recent review articles on ARGs, see Brandt et al. 2024; Lewanski et al. 2024; Nielsen et al. 2025; Wong et al. 2024).
Rasmussen et al. (2014) developed ARGweaver, a Bayesian method which outputs a sample from the posterior distribution of possible ARGs for a given genomic sample. ARGweaver relies on the idea of constructing an ARG by “threading” in sequences conditioned on the partial ARG of the previously added sequences, which is conceptually similar to the SMCSD framework (Paul and Song 2010; Paul et al. 2011; Steinrücken et al. 2013). The core threading procedure reconstructs the ARG for n lineages by determining the set of branches and coalescence times for the nth lineage to join the partial ARG of lineages, in a manner that is consistent with both the SMC model and the observed genetic variation. Given that the threading procedure is based on the underlying dynamics of the SMC, it is implemented as an HMM, where the hidden states represent the possible branches and coalescence times for the lineage being threaded to join the partial ARG, the transition probabilities describe the probability of branch-specific subtree prune and regraft operations under the SMC, and the emission probabilities correspond to the likelihood of the observed sequence data at the current genomic interval conditioned on the marginal genealogy, which is computed using Felsenstein’s pruning algorithm (Felsenstein 1973, 1981). ARGweaver relies on a Markov chain Monte Carlo (MCMC) sampler to explore the vast space of possible ARGs conditioned on the data and prespecified model parameters (e.g., mutation rate, recombination rate, and the demography). Each MCMC iteration involves removing a branch (or subtree) across all marginal genealogies and then re-threading it using the HMM. This allows for the quantification of uncertainty in both the ARG topology and estimated event times, but the HMM requires time to be discretized, and the procedure is too computationally intensive to scale beyond around 100 haploid genomes (Hubisz and Siepel 2020). To overcome the assumption that all input lineages coalesce in a panmictic ancestral population, ARGweaver-D (Hubisz et al. 2020) was introduced as a demography-aware extension that allows explicit modeling of gene flow events between distinct input lineages. The method was used to infer that ∼3% of the Neanderthal genome derives from an introgression from the early ancestors of modern humans 200 kya–300 kya. More recently, SINGER (Deng et al. 2025a) further enhanced the scalability of the threading approach by improving the efficiency of the HMM and MCMC sampler, enabling ARG inference for up to a couple of hundreds of haploid genomes (with human-like parameters). The inference from ARGweaver has been shown to be powerful for downstream applications such as fitting Bayesian population size history (Palacios et al. 2015) or identifying allele frequency changes to learn about selection (Stern et al. 2019; Vaughn and Nielsen 2024).
Arbores (Heine et al. 2018) is another ARG inference method that relies on the SMC, but adopts an alternative approach to threading: while this has the advantage of not requiring time discretization, it suffers from similar issues with scalability to ARGweaver. ARGinfer (Mahmoudi et al. 2022) reconstructs ARGs under the full CwR model. This is more computationally intensive than ARGweaver, but shows some improvement in the inference of evolutionary parameters (particularly recombination rate).
In contrast to these Bayesian methods that provide posterior samples, other methods which scale to much larger datasets forego the goal of principled uncertainty estimation in the ARG topology and/or event times, and instead rely on approximations to infer a single sensible ARG. For example, Relate (Speidel et al. 2019) and tsinfer/tsdate (Kelleher et al. 2019; Wohns et al. 2022) both use a modified Li and Stephens haplotype copying model (Li and Stephens 2003) to reconstruct local genealogies along the genome. On the other hand, ARG-Needle (Zhang et al. 2023) and Threads (Gunnarsson et al. 2024) reconstruct the ARG using a threading procedure, but instead of using MCMC to sample from the posterior distribution, they use a set of “threading instructions” to sequentially graft a haploid genome into the partial ARG, which results in a single inferred ARG after iteratively threading each haploid genome. While these methods do not provide estimates of uncertainty in both topology and event times like their Bayesian counterparts, they are able to scale to (hundreds of) thousands of haploid genomes. Another challenge is that ARGs inferred by these tools do not contain sufficient information for calculating likelihoods under the SMC, however some recent methodological progress has been made in this direction (Bisschop et al. 2025).
Apart from being key to their methodology, the SMC has also been used as a null model to assess the accuracy of these tools. Quantities calculated from reconstructed ARGs can be compared to their expectations under the SMC. Deng et al. (2021) and McKenzie and Eaton (2025) used the SMC to derive the distribution of the genomic distance between successive local trees, and Ignatieva et al. (2025) derived various distributions characterizing the genomic span of clades and edges in the ARG. This strategy presents an alternative to breaking up reconstructed ARGs into local trees and applying tree-based metrics to measure how close reconstructed ARGs are to the simulated ground truth (Kelleher et al. 2019; Speidel et al. 2019; Brandt et al. 2022).
Empirical Considerations
Like all inference tools, estimates generated by SMC-based methods are limited by the properties and quality of the available data. This issue is particularly pronounced when studying nonmodel systems with limited existing genomic resources.
Data Requirements
SMC-based HMMs estimate the coalescence rate (inverse ) at each recombination block from the set of variants observed within each block (see Section 1.4). The number of linked SNPs (HMM observations) in a single recombination block improves the estimate of the genealogy at that locus (HMM hidden state). The total number of recombination blocks across the sequence becomes the number of independent tMRCA estimates, and in turn, modulates the precision of the inference.
Genotyping Resolution
The properties of an organism’s genome and the proportion of its bases that can be confidently genotyped can limit the number of SNPs and blocks available to analyze (Nadachowska-Brzyska et al. 2016). The precision and stability of SMC-derived parameter estimates improve with the ratio of the mutation rate (θ) to the recombination rate (ρ), scaled by the proportion of the genome that is confidently genotyped (p). Although SMC-based methods generally remain robust even with fragmented datasets (Liu and Hansen 2017; Patton et al. 2019) demonstrated that PSMC estimates are typically reliable when , even when applied to datasets produced by reduced representation sequencing.
Low sequencing coverage can lead to inaccuracies in genotype calls, as factors such as sequencing technology, sample quality, and filtering criteria impact the ability to confidently call genotypes. Population genetic tools such as ANGSD (Korneliussen et al. 2014) have circumvented the need for strict genotype calls by utilizing genotype likelihoods. However, SMC-based methods have yet to adopt a framework that accounts for genotype uncertainty, with ARGweaver and ARGweaver-D being two notable exceptions that can use phred-likelihoods or genotype-likelihoods as input for ARG inference.
A promising way to mitigate issues with sequencing quality is to incorporate heritable markers beyond SNPs. Recently, Sellinger et al. (2024) introduced an SMC-based inference method, SMCtheo, which can in principle accommodate any genomic marker, provided that its occurrence can be modeled as a homogeneous Poisson mutation process along the genome and through time, thereby making it informative of the underlying genealogy. Analyzing hyper-mutable markers alongside SNPs has the potential to improve accuracy of inferred demographic histories over more recent timescales. For instance, SMCtheo has been successfully applied with Single Methylated Polymorphisms (SMPs). Sellinger et al. (2024) used both SNP and SMP markers simultaneously to estimate recent population bottlenecks in A. thaliana. However, not all hyper-mutable markers are suitable; for example, the lengths of Differentially Methylated Regions surpass the typical genomic distances between recombination events, making them incompatible with the SMC framework.
Genome Phasing
Even with high-quality genotype calls, whether the genomic data is phased or unphased is a key constraint. Because the coalescent process models the ancestral relationships between sampled haplotypes, rather than individuals, many SMC-based methods require phased data. This requirement limits researchers, as statistical phasing is only possible for large sample sizes, and its alternatives require large, phased reference panels. Meeting either condition can be challenging for systems outside of humans. A notable exception is PSMC and its variants, which only require homozygous/heterozygous calls from single diploid individual. MSMC2 can infer historical population sizes using unphased genomes, but its inferences are restricted to within-individual comparisons (i.e., not across pairs of genomes). This mode results in lower resolution compared to phased datasets, and cannot be used for population separation analyses or inference of historical migration rates using MSMC-IM.
Nevertheless, there are a handful of SMC-based methods that circumvent the necessity of phased data. One such approach is MiSTI, which provides an attractive alternative to MSMC-IM. MiSTI only considers two diploid individuals, and bypasses the need for phasing by utilizing coalescence rates inferred by PSMC along with the jSFS between the two individuals. Other SMC-based methods capable of handling unphased data include ARGweaver and ARGweaver-D. When the input data is unphased, these methods perform ARG inference by integrating over all possible phasings during their threading procedure.
Recency Drawback
A drawback of many SMC-based methods is their limited resolution in recent time frames. SMC-based inference works by inferring a local coalescence rate between two sequences from the mutation density (Section 1.4). When considering only two sequences, the majority of coalescence events are expected to occur in the more distant past, resulting in a dearth of information for SMC-based methods to exploit for making accurate inferences in more recent time periods (Nadachowska-Brzyska et al. 2016; Terhorst et al. 2017; Patton et al. 2019; Santiago et al. 2020). This issue is particularly concerning for conservation research, which often relies on a limited number of sequences to study the drivers of recent changes in population sizes.
While this limited resolution in recent time frames is most pronounced in SMC-based methods that make inferences from a single diploid individual (e.g., PSMC and its subsequent variants), methods such as MSMC2 have better resolution for this time frame because they make inferences from multiple individuals. However, MSMC2’s ability to infer recent changes in population sizes is confounded by phasing errors (Terhorst et al. 2017). In response to this confounding effect of phasing errors on demographic inference, Terhorst et al. (2017) developed SMC_++, as a phase-invariant alternative. This strategy leverages information from the rest of the samples to inform the genealogies of the haplotype pair being analyzed. Specifically, it improves the reliability of estimates, especially for notoriously difficult recent time-scales (Terhorst et al. 2017; Patton et al. 2019). SMC++_ also builds on PSMC by accommodating larger sample sizes and applying regularization to improve estimation error (e.g., fitting smooth splines instead of assuming piecewise constant population sizes).
Outside of the inherent theoretical challenges, SMC-based methods can also produce biased inferences of on more recent time scales due to a lack of a priori demographic knowledge for the study system, or less than ideal choices of parameter values for the inference method (Parag and Pybus 2019). For example, Hilgers et al. (2025) reported a distinctive pattern of peaks in followed by a dramatic decline in recent time scales inferred by PSMC across many studies. Through an extensive simulation-based and empirical sensitivity analysis, Hilgers et al. (2025) found that these biased results could be explained by unobserved population structure or solely relying on PSMC’s default settings for specifying the discrete time windows, which were designed for studying humans. Mitigation strategies include changing the discretization of time to identify statistical artifacts in recent time bins and validating results with simulated data. However, accurately inferring recent demographic changes remains a challenge for all SMC-based approaches that solely rely on a single diploid sequence (Sellinger et al. 2021; Hilgers et al. 2025).
SFS as a Summary Statistic for Recent Times
An alternative to SMC-based methods are those that rely on the distribution of allele frequencies in a population, i.e., the site frequency spectrum (SFS). SFS-based approaches are especially useful for recent temporal resolution, as the SFS is a more informative summary statistic for recent demographic events (i.e., within the last 100 generations) (Beichman et al. 2018; Patton et al. 2019; Reid and Pinsky 2022). SFS methods have shown enough resolution to model recent complex migration events in highly heterogeneous human populations (Serradell et al. 2024), and to trace a more complete picture of the demographic context of ancient samples (Kamm et al. 2020; Sümer et al. 2025). However, their accuracy is influenced by the sample size used to construct the SFS (Terhorst et al. 2015; Beichman et al. 2018; Patton et al. 2019; Reid and Pinsky 2022). SFS-based models assume independence among SNPs and do not rely on linkage disequilibrium, making them a flexible alternative when no reference genome is available, for lower-depth whole genome sequencing, and even for reduced representation sequencing data (Beichman et al. 2018; Liu and Fu 2020; Excoffier et al. 2021; Lesturgie et al. 2022). However, the SFS suffers from identifiability and sensitivity issues (Myers et al. 2008; Bhaskar and Song 2014; Terhorst and Song 2015; Baharian and Gravel 2018; Rosen et al. 2018) which can lead to poorly supported inferences of population history (Cousins and Durvasula 2025; Deng et al. 2025b). Alternatively, LD-based methods can outperform both SMC- and SFS-based approaches; however, these methods gain power and accuracy primarily with larger whole-genome sample sizes and require complete reference assemblies (Beichman et al. 2018; Santiago et al. 2020; Reid and Pinsky 2022; Fournier et al. 2023), creating a practical trade-off.
Conclusions and Further Directions
The SMC has proven to be a powerful and versatile model, enabling researchers to study questions about demographic history, natural selection, mutation and recombination rates, life history traits, structural variation, population structure, gene flow, and speciation. As SMC-based methods tackle more complex inference tasks and accommodate less stringent data requirements, there is a pressing need to validate these methods considering demographic parameters that reflect more realistic and complex biological and ecological scenarios. While tools are often validated using simulations, these may not fully capture the complexities of natural sequences (Wang et al. 2021). Furthermore, as demonstrated by Hilgers et al. (2025), users must also validate their specific implementation choices with simulations to ensure that inferences are consistent with the observed sequence data, allowing for robust interpretation of results. Given that even modest amounts of phasing errors can bias SMC-based inferences (Terhorst et al. 2017), it would be desirable for future methods to remain agnostic to phase-level resolution. This would avoid potential sources of bias and enable the application of SMC-based methods to a wider range of empirical systems where phasing is unavailable or unreliable. However, recent advances in long-read sequencing technologies (van Dijk et al. 2023; Warburton and Sebra 2023) and the generation of complete telomere-to-telomere reference genomes (Naish et al. 2021; Nurk et al. 2022; Chen et al. 2023; Huang et al. 2023; O’donnell et al. 2023; Liu et al. 2024; Yoo et al. 2025) may soon make phasing far less of a concern for SMC-based inference. These exciting developments, together with the rapid increase of resequencing data, highlight the importance of continuing to develop SMC-based methods that are fast, efficient, and scalable. Approaches such as Gamma-SMC demonstrate how methodological innovations can keep pace with ever-growing genomic datasets, and provide a motivating example for future methods development. At the same time, extending SMC-based methods to handle low-coverage data would also be beneficial, broadening their applicability to more nonmodel systems, which often have fewer genomic resources, as well as to the ever-growing datasets of low-coverage ancient genomes (Mallick et al. 2024).
In addition to addressing data-level biases, model-level biases must also continue to be addressed. Current approaches like MSMC-IM, cobraa, and MiSTI highlight that the assumption of panmixia is a common model violation in SMC-based inference. However, these methods model population structure differently, and further exploration of alternative structured models is warranted. In a similar vein, with TRAILS demonstrating the ability to infer the ARG for three species, extending such approaches to incorporate varying ancestral effective population sizes (as in PSMC-based inferences) and models of speciation with gene flow would be valuable, especially as introgression is now increasingly recognized as a frequent process across the eukaryotic tree of life (Dagilis et al. 2022). While SMC-based methods for simulating and studying structural variation have been developed (Peischl et al. 2013; Ignatieva et al. 2025), further work is needed to make these applicable to long-read sequencing datasets, to more fully understand the evolution of SVs of different types.
Another avenue for methodological work is to directly examine the transition matrix of SMC-based HMMs (Sellinger et al. 2021). ARG inference methods offer a way to estimate this transition matrix directly from data, which has been explored as input for SMC-based inference (Strütt et al. 2023). In certain cases, the posterior distribution can be decoded exactly—without the use of a discrete HMM—which avoids biases introduced by the discretization of time (Ki and Terhorst 2024; Schweiger and Durbin 2023). As shown by Cousins et al. (2025), the transition matrix itself can be used to discern between demographic histories of panmixia versus population structure. This approach highlights the potential for using the transition matrix as a summary statistic within an Approximate Bayesian Computation framework to test more complex competing demographic models.
Methods based on the SMC continue to emerge and prove useful for questions beyond historical effective population size. Accommodating alternative targets and nonstandard coalescence processes has made SMC-based inference methods relevant beyond human population genetics. Concurrently, statistical methods for reconstructing ARGs from sequencing data have turned to the SMC to efficiently generate and evaluate genealogies. We hope to bring awareness of new SMC-derived approaches into ecology, conservation, and natural history to support their creative use in empirical studies, as well as highlight their relevance as we move into the ARG era.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ashburner M . Drosophila: a laboratory manual. Cold Spring Harbor Press; 1989.
- 2Baharian S, Gravel S. On the decidability of population size histories from finite allele frequency spectra. Theor Popul Biol. 2018:120:42–51. 10.1016/j.tpb.2017.12.008.29305873 · doi ↗ · pubmed ↗
- 3Barroso GV, Dutheil JY. The landscape of nucleotide diversity in Drosophila melanogaster is shaped by mutation rate variation. Peer Community J. 2023:3:e 40. 10.24072/pcjournal.267. · doi ↗
- 4Barroso GV, Puzović N, Dutheil JY. Inference of recombination maps from a single pair of genomes and its application to ancient samples. P Lo S Genet. 2019:15:e 1008449. 10.1371/journal.pgen.1008449.31725722 PMC 6879166 · doi ↗ · pubmed ↗
- 5Beichman AC, Huerta-Sanchez E, Lohmueller KE. Using genomic data to infer historic population dynamics of nonmodel organisms. Annu Rev Ecol Evol Syst. 2018:49:433–456. 10.1146/ecolsys.2018.49.issue-1. · doi ↗
- 6Bhaskar A, Song YS. Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data. Ann Stat. 2014:42:2469–2493. 10.1214/14-AOS 1264.28018011 PMC 5175586 · doi ↗ · pubmed ↗
- 7Bisschop G, Kelleher J, Ralph P. Likelihoods for a general class of AR Gs under the SMC. Genetics. 2025:iyaf 103. 10.1093/genetics/iyaf 103.40439129 PMC 12774825 · doi ↗ · pubmed ↗
- 8Bouckaert R et al BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. P Lo S Comput Biol. 2019:15:e 1006650. 10.1371/journal.pcbi.1006650.30958812 PMC 6472827 · doi ↗ · pubmed ↗
