Optimizing genomic sampling for demographic and epidemiological inference with Markov decision processes

David A Rasmussen; Madeline G Bursell; Frank Burkhart

PMC · DOI:10.1093/genetics/iyaf244·November 11, 2025

Optimizing genomic sampling for demographic and epidemiological inference with Markov decision processes

David A Rasmussen, Madeline G Bursell, Frank Burkhart

PDF

Open Access

TL;DR

This paper introduces a new framework using Markov decision processes to optimize genomic sampling strategies for better demographic and epidemiological insights.

Contribution

The novel use of Markov decision processes to model and optimize genomic sampling strategies for maximizing information gain.

Findings

01

Markov decision processes can efficiently identify optimal sampling strategies for estimating population growth rates.

02

The framework helps minimize transmission distance between sampled individuals in genomic epidemiology.

03

Optimal sampling strategies can be identified for estimating migration rates between subpopulations.

Abstract

Inferences from population genomic data provide valuable insights into the demographic history of a population. Likewise, in genomic epidemiology, pathogen genomic data provide key insights into epidemic dynamics and potential sources of transmission. Yet, predicting what information will be gained from genomic data about variables of interest and how different sampling strategies will impact the quality of downstream inferences remains challenging. As a result, population genomics and related fields such as phylodynamics and phylogeography largely lack theory to guide decisions on how best to sample individuals for genomic sequencing. By adopting a sequential decision making framework based on Markov decision processes, we model how sampling interacts with a population’s demographic history to shape the ancestral or genealogical relationships of sampled individuals. By…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

gold

Diseases4

MDP death infectious pathogen infected

Figures8

Click any figure to enlarge with its caption.

Fig. 1 — Challenges associated with optimizing genomic sampling. The three examples illustrate how the informational value of sampling can depend on the ancestral relationships among sampled individuals. a) In coalescent-based inference tasks such as inferring past changes in population size, the ancestral relationships among sampled individuals will determine when coalescent events occur, which will in turn depend on the population size N(τ) at time τ in the past. For example, sampling at present may provide little information about recent population sizes if all sampled lineages coalesce deep in the past. b) In genomic epidemiology, interest often lies in reconstructing transmission trees to learn about potential transmission pathways. Sampling individuals linked by one or a small number of intervening transmission events (i.e. short transmission distances) can provide more information about who may have infected whom. However, the distance between sampled individuals may depend on past sampling. Here, the transmission distance between individual C and its nearest sampled neighbor depends entirely on who was previously sampled. c) In phylogeography, the movement of individuals is reconstructed based on the probable location of ancestral nodes. However, different patterns of movement can be inferred depending on exactly which individuals are sampled from each population. In this case, sampling individual A reveals an additional migration event that could dramatically alter the most probable root state, and therefore the presumptive source population, whereas sampling individual B provides little additional information about movement.

Fig. 2 — The exponential growth coalescent MDP. At each decision time t, the agent decides what action at to take in terms of how many individuals to sample. The state st is determined by the sample configuration, the number of samples taken at each time point up to and including t. The reward function quantifies how much Fisher information about the growth rate γ is gained from a sampling action. Here, we illustrate computing the reward for the light gray node sampled at t=3. Although a total of four individuals are sampled at t=3, the sample configuration st=[1,2,3] and a=1 because we are considering the incremental value of sampling the fourth individual conditional upon having already sampled the three other individuals. The amount of Fisher information gained from sampling this individual will depend on the population size N(τcoal) at the time τcoal when the sampled lineage coalesces with the rest of the sample. Because we do not know when τcoal will occur before sampling, we compute the expected reward by integrating over all possible coalescent times.

Fig. 3 — The transmission tree distance MDP. At each decision time t, the agent decides what action at to take in terms of which individuals to sample. The state st is determined by the set of all individuals sampled up to t, where ∅ represents a set with no samples. The reward function quantifies the reward gained from sampling individuals at a given transmission distance dmin(i). Here, we illustrate computing the rewards based on inverse distances where the reward received from sampling an individual is equal to the inverse of the transmission distance between the newly sampled node and its nearest sampled neighbor.

Fig. 4 — The structured coalescent MDP. At each decision time t, the agent decides on an action at that determines how many individuals are sampled from each population. The state st is determined by the number of samples taken from each population at all times up to and including t. The reward function quantifies how many coalescent events occur between populations as a proxy for the information gained about the estimated migration rates. Thus sampling individual A provides a reward r=1 as it adds a coalescent event occurring between populations, whereas sampling individual B does not and therefore provides a reward r=0. Because we do not know when and where sampled lineages will coalesce, the expected rewards are computed by integrating over all possible coalescent times τ and possible states of the parent and child lineages.

Fig. 5 — Expected rewards under the exponential growth coalescent MDP. Time is measured in units of 1/γ going backwards from the present at τ=0. a) The expected reward or Fisher information gained from sampling different numbers of individuals at different time points computed analytically (solid lines) versus the rewards returned from Monte Carlo simulations (dashed lines). b) The optimal sampling strategy obtained through value iteration. The optimal sampling action at each time is marked by a gold diamond. The expected net reward of each possible action is computed conditional on making the optimal action at all other times points. In all cases, N(0)=50 and the growth rate γ=0.1.

Fig. 6 — Performance of the exponential growth coalescent MDP at computing informational rewards. a) The rewards (Fisher information) versus costs of sampling across all 15,625 policies considered. Colors indicate the net reward returned for each policy and the optimal policy found by both the value iteration and a brute force search is marked by the gold diamond. b) Mean reward of each policy from Monte Carlo simulations versus expected reward computed analytically under the MDP. c) Expected Fisher information rewards versus root mean squared error (RMSE) of maximum likelihood estimates (MLEs) of the growth rate. d) Error in MLEs predicted based on expected Fisher information versus the observed RMSE.

Fig. 7 — Expected rewards versus costs of sampling uniform fractions of individuals in the transmission tree MDP. In A and C, rewards are computed assuming the value of a sampled individual is inversely proportional to its distance to its nearest sampled neighbor. In B and D, rewards are computed based on the number of direct transmission pairs sampled. Rewards are computed both analytically under the MDP (solid lines) and by taking the mean across 100 Monte Carlo simulations (dashed lines). a) Rewards versus costs of different sampling fractions with rewards based on inverse weighting. b) Rewards versus costs of different sampling fractions with rewards based on direct pairs. c) Expected net rewards with rewards based on inverse weighting assuming different values of information (VOI), which weights the relative value of rewards versus the costs. d) Expected net rewards with rewards based on direct pairs.

Fig. 8 — Expected rewards under the structured coalescent MDP. a) The expected reward of sampling varying fractions of individuals in each population at a single time point computed analytically under the MDP. b) The mean rewards observed from Monte Carlo simulations of the structured coalescent. In both A and B, the number of migration events is square root transformed. c) The average root mean squared error (RMSE) of maximum likelihood estimates (MLEs) of the migration rates from 100 simulations at each pair of sampling fractions. The reported RMSE is averaged over migration rates estimated in both directions (i.e. RMSE=(RMSE(m12)+RMSE(m21))/2). d) The net reward of sampling when considering the costs of sampling. In all simulations, N=50 in each population and migration is symmetric between populations (κ=1.0) with the migration rates m12=m21=0.02 per generation.

Tables4

Table 1.. Summary of MDP model notation and parameter definitions.

Parameter	Definition	Parameter	Definition
t	decision time point index	$r (s, a)$	expected reward of action a in state s
$τ_{t}$	actual time at which t occurs	$c (s, a)$	expected cost of action a in state s
$s_{t}$	state at time t	π	policy function
$S$	set of all possible states	$π^{*}$	optimal policy function
$a_{t}$	action at time t	$v_{π} (s_{t})$	expected long-term value of $s_{t}$
$A$	set of all possible actions	$q_{π} (s_{t}, a_{t})$	expected long-term value of $a_{t}$ in $s_{t}$
r	reward value	ω	value of information
$R$	set of all possible rewards

Table 2.. Optimal sampling strategies for estimating population growth rates.

Growth rate γ	Pop size N(0)	Cost decrease^a	$a_{t_{0}}^{*}$	$a_{t_{1}}^{*}$	$a_{t_{2}}^{*}$	$a_{t_{3}}^{*}$	$a_{t_{5}}^{*}$	Sim rank^b
0.05	50	1X	1.0	1.0	0.25	0	0	1st
0.1	50	1X	1.0	1.0	0.25	0	0	3rd
0.2	50	1X	1.0	1.0	0.25	0	0	1st
0.1	100	1X	1.0	1.0	0	0	0	1st
0.1	200	1X	1.0	1.0	0	0	0	1st
0.1	50	2X	1.0	1.0	1.0	0	0	1st
0.1	50	5X	1.0	1.0	0.25	0	0.25	2nd
0.1	50	10X	1.0	1.0	0.25	0.25	0.25	2nd

Table 3.. Optimal sampling strategies for minimizing transmission distances.

Pop dynamics	Reward	VOI^a	$a_{t_{0}}^{*}$	$a_{t_{1}}^{*}$	$a_{t_{2}}^{*}$	$a_{t_{3}}^{*}$	$a_{t_{4}}^{*}$	Sim rank^b
Constant	Inverse	0.5	0.25	0	0	0	0	1st
Constant	Inverse	0.75	1	1	1	1	1	4th
Constant	Inverse	1.5	1	1	1	1	1	1st
Constant	Direct	0.5	0.25	0	0	0	0	1st
Constant	Direct	1.0	0.5	1	1	1	1	5th
Constant	Direct	1.5	0.75	1	1	1	1	3rd
Doubling	Inverse	0.5	1	0.5	0	0	0	1st
Doubling	Inverse	1.0	1	1	1	1	0.25	1st
Doubling	Inverse	1.5	1	1	1	1	1	1st
Doubling	Direct	0.5	1	0.5	0	0	0	1st
Doubling	Direct	1.0	1	1	1	1	0.25	1st
Doubling	Direct	1.5	1	1	1	1	1	1st

Table 4.. Optimal sampling strategies for estimating migration rates.

Migration rate m	Ratio κ^a	$a_{p_{1}, t_{0}}^{*}$	$a_{p_{2}, t_{0}}^{*}$	$a_{p_{1}, t_{1}}^{*}$	$a_{p_{2}, t_{1}}^{*}$	$a_{p_{1}, t_{2}}^{*}$	$a_{p_{2}, t_{2}}^{*}$	Sim rank^b
0.02	1.0	0.18	0.18	0.16	0.16	0.16	0.16	1st
0.1	1.0	0.2	0.2	0.18	0.18	0.16	0.16	1st
0.2	1.0	0.2	0.2	0.18	0.18	0.16	0.16	3rd
0.02	2.0	0.16	0.2	0.14	0.18	0.14	0.18	1st
0.1	2.0	0.16	0.2	0.14	0.2	0.14	0.2	1st
0.2	2.0	0.16	0.2	0.16	0.16	0.0	0.2	4th
0.02	10.0	0.1	0.2	0.12	0.2	0.1	0.18	1st
0.1	10.0	0.1	0.2	0.12	0.2	0.1	0.2	1st
0.2	10.0	0.1	0.2	0.12	0.2	0.08	0.2	8th

Equations23

Funding3

—U.S. Center for Disease Control and Prevention10.13039/100000030
—National Science Foundation10.13039/100000001
—U.S. Department of Agriculture’s10.13039/100000199

Keywords

population genomicsdemographic inferencegenomic epidemiologysampling theoryMarkov decision process

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenetic Associations and Epidemiology · Genetic diversity and population structure · COVID-19 epidemiological studies

Full text

Introduction

The ability to sequence many individual genomes from a population has greatly expanded our ability learn about a population’s demographic history, including changes in population size (Kuhner et al. 1998; Drummond et al. 2005), population structure (Beerli and Felsenstein 1999; Müller et al. 2017), and ancestry (Kelleher et al. 2019; Hubisz et al. 2020; Spyrou et al. 2022). Genomic epidemiology provides a remarkable example of this trend, with pathogen genomes now being routinely analyzed to infer changes in epidemic dynamics (Stadler et al. 2013; Attwood et al. 2022) and reveal potential pathways of transmission (De Maio et al. 2016; Ratmann et al. 2016; Eyre et al. 2021). Despite these advances, it remains generally unclear how best to sample individuals for the purposes of genomic sequencing and downstream demographic or epidemiological inference (Wohl et al. 2021; Inward et al. 2022). Because it is usually only feasible to sample a small fraction of the total population, questions about sampling design inevitably arise, including: who to sample, what sample sizes are sufficient, and how representative must sampling be to provide reliable inferences?

Unfortunately, unlike other areas of modern statistical inference (Robbins 1952; Chernoff 1992; Atkinson 1996), population genomics and genomic epidemiology largely lack theory to guide decisions on how best to sample and sequence individual genomes (herein we refer to both sampling and sequencing simply as sampling). Previous work considered optimizing sampling for genetic variant discovery, but largely neglected downstream statistical inferences (Ionita-Laza and Laird 2010; Hicks et al. 2020; Masoero et al. 2022; Wohl et al. 2023). In absence of such guidance, sampling is often performed based on simple heuristics or opportunistically based on available resources (Frost et al. 2015; Brito et al. 2022). As a result, sampling can be highly biased and non-representative. These sampling biases can in turn bias downstream demographic inferences, which may be quite sensitive to violations of assumptions about how individuals are sampled (Hall et al. 2016; Karcher et al. 2016; Guindon and De Maio 2021; Layan et al. 2023). For example, in phylogeography, over or under sampling a population can significantly bias downstream inferences of migration rates and the reconstructed location of ancestral lineages (De Maio et al. 2015; Müller et al. 2017).

At the same time, optimizing sampling strategies in population genomics can be challenging for several reasons (Fig. 1). First, the optimal sampling strategy may be highly dependent on a population’s demographic history, which may not be fully observable (Fig. 1a). Second, the information gained from sampling may be highly dependent on past or future sampling efforts. Optimal strategies therefore need to reflect what data is already available and what data may become available in the future (Fig. 1b). Third, what information will be gained from new data is often highly unpredictable. Sampling more may in some cases improve estimates, but in other cases may exacerbate existing biases (Fig. 1c). As we elaborate below, these challenges largely stem from a more fundamental problem: the information gained from genomic data almost always depends on how sampled individuals are related to one another through their shared ancestry or phylogenetic history, yet these relationships cannot be known prior to sampling. Ultimately, this inability to predict the informational value of genomic data makes it difficult to answer even the most pragmatic questions about sampling, such as what is an appropriate sample size?

Challenges associated with optimizing genomic sampling. The three examples illustrate how the informational value of sampling can depend on the ancestral relationships among sampled individuals. a) In coalescent-based inference tasks such as inferring past changes in population size, the ancestral relationships among sampled individuals will determine when coalescent events occur, which will in turn depend on the population size N(τ) at time τ in the past. For example, sampling at present may provide little information about recent population sizes if all sampled lineages coalesce deep in the past. b) In genomic epidemiology, interest often lies in reconstructing transmission trees to learn about potential transmission pathways. Sampling individuals linked by one or a small number of intervening transmission events (i.e. short transmission distances) can provide more information about who may have infected whom. However, the distance between sampled individuals may depend on past sampling. Here, the transmission distance between individual C and its nearest sampled neighbor depends entirely on who was previously sampled. c) In phylogeography, the movement of individuals is reconstructed based on the probable location of ancestral nodes. However, different patterns of movement can be inferred depending on exactly which individuals are sampled from each population. In this case, sampling individual A reveals an additional migration event that could dramatically alter the most probable root state, and therefore the presumptive source population, whereas sampling individual B provides little additional information about movement.

As a step towards optimizing genomic sampling, we pose the problem of sampling design within a sequential decision making framework (Bellman and Kalaba 1959; Chernoff 1992). This framework naturally emulates the way decisions about sampling need to be updated as new information becomes available, regardless of whether sampling occurs at a single time point (i.e. iso- or homochronous sampling) or sequentially through time (i.e. heterochronous sampling). Moreover, this framework allows us to jointly model the sampling process together with the underlying ecological or epidemic dynamics of a population as a Markov decision process (MDP). MDPs are widely used to optimize the control of dynamical systems (Bellman 1957; Alagoz et al. 2010; Azouri et al. 2024), and are increasingly being applied to problems such as optimizing epidemic control and antibiotic drug treatment strategies (Lefévre 1981; Bussell et al. 2019; Weaver et al. 2024). While our interest lies not in directly controlling the dynamics of a population, we show how the well-developed theoretical machinery of MDPs can be adapted to efficiently predict the long-term reward or value of a given sampling strategy in terms of the information gained about an estimated parameter or variable. MDPs can therefore guide and structure the search towards strategies that optimize the expected value of sampling.

While our framework has much in common with standard MDPs, extending them to genomic sampling presents several difficulties. We therefore begin with a general review of MDPs for sequential decision making problems before explaining how MDPs can be adapted to optimize genomic sampling. We then apply our framework to three common scenarios in population genomics where the ancestry of the sample can be captured by a single genealogy or phylogenetic tree: estimating growth rates of a growing population, minimizing transmission distances between sampled individuals and estimating migration rates between subpopulations. In the Results, we explore optimal sampling strategies for each scenario and compare the learned strategies against brute force searches to demonstrate that the strategies identified by MDPs are in fact globally optimal.

Materials and methods

Markov decision processes

We first describe the core components of a generic MDP. In a MDP, an agent representing a decision maker interacts with its environment through a series of actions. At each decision time point, the agent receives a reward based on the chosen action and the state of the environment, which may depend on the agent’s previous actions. In most applications, the goal is for the agent to learn how to make actions that maximize the total or long-term reward.

Adopting standard notation (Alagoz et al. 2010; Sutton and Barto 2018), let $[eqn]$ and $[eqn]$ represent the state of the environment and the action the agents chooses at time t, respectively. The dynamics of the environment are described by a set of transition probabilities $[eqn]$ , which provide the probability of the state changing from state $[eqn]$ at time t to $[eqn]$ at time $[eqn]$ given that action $[eqn]$ was taken by the agent. The actions the agent takes are determined by a policy function $[eqn]$ , which may be either deterministic or stochastic. For any action a taken from any state s, the agent receives a reward $[eqn]$ with probability $[eqn]$ . The expected reward the agent receives for any state-action pair is given by the reward function:

[eqn]

We assume throughout that decisions are made at a fixed sequence of time points $[eqn]$ over a finite horizon ending at $[eqn]$ . However the underlying dynamics of the Markov process may evolve on a different timescale in discrete or continuous time depending on the application. To distinguish between these timescales, we let $[eqn]$ represent the actual time at which decision t is made. The time elapsed between two successive decision points $[eqn]$ is arbitrary and not necessarily constant.

We are generally interested in finding an optimal policy function $[eqn]$ that maximizes the long-term reward the agent receives. The optimal policy $[eqn]$ is therefore formally defined as the strategy that maximizes the expected cumulative sum of future rewards:

[eqn]

In order to learn the optimal policy, we need to be able to compute the expected value or long-term reward of a given policy. The expected value of a given policy π starting in state $[eqn]$ can be computed recursively using the Bellman equations (Bellman 1957; Bellman and Kalaba 1959):

[eqn]

We can also write the Bellman equations in terms the expected value of taking action $[eqn]$ in state $[eqn]$ under a given policy π:

[eqn]

The value function $[eqn]$ enables us to compute the total expected reward that will be gained over time if a given action is taken at the current time and we continue to act according to the policy π. Being able to compute these expected long-term values is vitally important for identifying actions that are globally optimal and not just locally optimal in terms of maximizing the immediate rewards received upon taking an action. As we shall see, this becomes especially important for learning optimal genomic sampling strategies where the long-term rewards of sampling at present will often depend on past and future sampling decisions. Table 1 provides a summary of the MDP model notation used throughout the remainder of the manuscript.

Learning optimal strategies by dynamic programming

The value functions presented in (3) and (4) can be efficiently evaluated using dynamic programming (Bellman 1954). To evaluate the expected value of a given policy π, we use an iterative policy evaluation algorithm (Sutton and Barto 2018). Starting from an initially arbitrary estimate of the value $[eqn]$ for each state $[eqn]$ , each iteration of the algorithm updates the estimate of $[eqn]$ using the expected value of successor states $[eqn]$ obtained from previous iterations. This procedure is iterated until the algorithm converges on a stable estimate of V.

To identify the optimal policy, we use a backward induction or value iteration algorithm (Alagoz et al. 2010; Sutton and Barto 2018). This algorithm combines iteratively updating expected values (as in policy evaluation) with improvements to the policy. This procedure is iterated until the algorithm converges on a stable estimate of V where the policy can no longer be improved. Given the optimized value function V, we can then find the optimal action $[eqn]$ to take from any state $[eqn]$ : $[eqn]$ . Identifying $[eqn]$ for all possible states therefore provides the optimal policy. Pseudocode for the policy evaluation and value iteration algorithms is provided in Supplementary Text 1.

Markov decision processes for genomic sampling

Before considering specific MDPs for genomic sampling, we consider the general motivation behind our approach. Our general goal is to maximize the amount of information we gain from genomic data about estimated variables of interest, while also minimizing the total cost of sampling. How much information we gain from sampling will generally depend on how sampled individuals are related to one another through their shared ancestry or phylogenetic history. For example, how much information is gained from sampling an individual under standard coalescent models will depend on when coalescent events occur in the ancestry of the sample (Fig. 1a). While it may be possible to reconstruct these phylogenetic relationships from genomic data, we do not know how sampled individuals will be related to one another prior to sampling, and thus cannot directly predict what information will be gained from sampling. This is indeed one of the major challenges of designing optimal sampling strategies: we generally do not know the value of sampling until we actually sample.

Our MDP framework attempts to circumvent this problem by finding the expected value of sampling without a priori knowledge of how sampled individuals will be related. In essence, we compute expected values by marginalizing or integrating over all possible ancestral or phylogenetic histories of the sample. When computing the expected value of a newly sampled individual, we consider where and when this individual might “attach” or coalesce with the rest of the sample. We can then compute the expected value of the new sample by integrating over all possible attachment points, weighting each by what information would be gained from observing an event at that time and/or location. For simplicity, we assume no recombination or other horizontal exchanges of genetic material, such that the ancestry of the sample can be captured by a single phylogenetic tree.

This MDP framework can be thought of as a special case of the more general MDPs considered above. The Markov process describes how the underlying population dynamics gives rise to the phylogeny relating the sampled individuals. On top of this Markov process is a sequential decision making process that determines who or how many individuals are sampled at each decision time point. In this case:

The state represents the sample configuration, typically the number or set of individuals previously sampled.Decision time points represent sampling events.Actions update the state through sampling events.The action policy represents the sampling strategy.The rewards reflect information gained from the sample about estimated quantities of interest.

We stress that the actual amount of time elapsed between decision time points is arbitrary. Multiple sampling events can occur at a single point in real time (e.g. the present) or sequentially through time at any specified frequency. Our framework can therefore be used to determine when to sample by allowing for the possibility of taking no samples at a given time point. Moreover, when the “present” occurs is a matter of perspective. While we adopt the convention that the final sampling time represents the present, the MDP framework readily accommodates prospective sampling where sampling events occur in the future.

In order to find the optimal sampling strategy, we need to be able to evaluate the long-term expected value of a given sampling or action policy. This in turn requires us to compute the expected reward $[eqn]$ of taking a particular action from a given state. In the relatively simple MDPs we consider next, these expected rewards can be computed analytically. While it may not always be possible to compute expected rewards analytically, other strategies can be used to approximate these rewards, such as by averaging over the rewards returned from Monte Carlo simulations of the MDP. Either way, once we can compute the the expected rewards, we can combine the value functions in (3) and (4) with dynamic programming to learn the optimal sampling policy that maximizes long-term expected rewards.

Exponential growth coalescent MDP

We first consider a MDP for sampling individuals under the coalescent with exponential growth (Slatkin and Hudson 1991; Kuhner et al. 1998). Fig. 2 provides a schematic representation of the MDP. Our overall goal is to optimize sampling in order to maximize information about the estimated population growth rate γ while minimizing the total cost of sampling.

The exponential growth coalescent MDP. At each decision time t, the agent decides what action at to take in terms of how many individuals to sample. The state st is determined by the sample configuration, the number of samples taken at each time point up to and including t. The reward function quantifies how much Fisher information about the growth rate γ is gained from a sampling action. Here, we illustrate computing the reward for the light gray node sampled at t=3. Although a total of four individuals are sampled at t=3, the sample configuration st=[1,2,3] and a=1 because we are considering the incremental value of sampling the fourth individual conditional upon having already sampled the three other individuals. The amount of Fisher information gained from sampling this individual will depend on the population size N(τcoal) at the time τcoal when the sampled lineage coalesces with the rest of the sample. Because we do not know when τcoal will occur before sampling, we compute the expected reward by integrating over all possible coalescent times.

Markov process

We assume that the population grows deterministically in continuous time such that the population size $[eqn]$ at time τ in the past is given by:

[eqn]

where $[eqn]$ is the population size at present.

While the population dynamics are deterministic, the coalescent model tracks the ancestry of sampled individuals backwards through time as a stochastic process. If we make the standard assumption that the number of lineages ancestral to the sample is small relative the total population size (Kingman 1982; Hudson 1990), ancestral lineages coalesce at a rate $[eqn]$ determined by the time-varying population size $[eqn]$ and the number of ancestral lineages k present in the genealogy:

[eqn]

The probability of a coalescent event occurring at time τ in the past follows an exponential-like probability density:

[eqn]

Action space

On top of the underlying coalescent process, there is a sequential decision making process where at each time point t the agent decides how many individuals $[eqn]$ to sample according to the policy $[eqn]$ .

Note that while time in the coalescent model goes backwards from $[eqn]$ at present, we order the decision time points in forward time from $[eqn]$ at the earliest decision time (furthest in the past) to the last decision time point at $[eqn]$ . We use $[eqn]$ to refer to the actual time at which decision time point t in units of backwards coalescent time.

State space

The state $[eqn]$ of the system is determined by the current sample configuration $[eqn]$ , the number of individuals sampled at each decision time up to and including time t. The population size $[eqn]$ is assumed known from the deterministic population dynamics.

Reward function

We use Fisher information to quantify the expected amount of information we will gain about the population growth rate from sampling individual genomes. In general, Fisher information measures the information an observed random variable X provides about an estimated parameter θ (Efron and Hinkley 1978; Ly et al. 2017). For a continuous parameter θ, Fisher information can be defined as:

[eqn]

[eqn]

where $[eqn]$ is the likelihood or probability density of X given θ and the expectation is taken over all possible values of the random variable X.

Intuitively, by considering the second-derivative of the log likelihood, Fisher information quantifies the curvature of the likelihood support curve around a given value of θ. More Fisher information therefore means the likelihood is more peaked around θ, and thus that we have more certainty about the value of θ. Moreover, Fisher information can be used to predict the expected error in a maximum likelihood estimate (MLE) of θ. Because the sampling distribution of the MLEs should asymptotically converge to a normal distribution with a standard error given by the inverse of the square root of the Fisher information $[eqn]$ , the error in a MLE of θ is predicted to be $[eqn]$ (Ly et al. 2017).

Because sampling each individual adds one additional coalescent event to the phylogenetic tree, we consider the Fisher information $[eqn]$ gained from each coalescent event. With constant population size N, the waiting times between coalescent events are independently and exponentially distributed with rate parameter $[eqn]$ . The Fisher information $[eqn]$ about N provided by each coalescent event is therefore the same as the observation of an exponentially distributed random variable $[eqn]$ (Parag and Pybus 2019). Because the waiting time between events are independent, the information provided by each sample is additive such that the total Fisher information provided by m samples is $[eqn]$ .

For an exponentially growing population, the waiting times between coalescent events remain independently distributed random variables, but they are no longer exponentially distributed because the coalescent rate increases as N decreases into the past. However, we assume that $[eqn]$ is still proportional to the population size at the time τ at which coalescent events occurs such that $[eqn]$ . Simulation results suggest this is an excellent approximation (see Fig. 6c). Thus, coalescent events occurring further in the past when the population is small will carry more information about the growth rate than events closer to present.

Expected rewards

The value of sampling each individual depends on when the sampled lineage coalesces with the other lineages in the sample, which we cannot observe before sampling. Therefore, in order to compute the expected value of sampling a given individual, we track the probability density $[eqn]$ for the time τ at which the sampled individual coalesces with the other sampled lineages, conditional on the current sample configuration $[eqn]$ . Full details on how $[eqn]$ is computed are provided in Supplementary Fig. 1 and Supplementary Text 2.

Given $[eqn]$ , the expected value of sampling one individual in terms of Fisher information $[eqn]$ can then be computed by integrating over the unknown time at which the newly sampled lineage coalesces with the other sampled lineages:

[eqn]

Because the reward of sampling an individual at time t will depend on how many other individuals are sampled at time t, we rewrite $[eqn]$ as the tuple $[eqn]$ to distinguish between samples taken at previous times and the present time $[eqn]$ . The full expected value of sampling $[eqn]$ individuals conditional on $[eqn]$ can then be built up iteratively by considering the value of sampling one individual at a time conditional upon having already sampled k other individuals at time t:

[eqn]

Transmission tree distance MDP

Minimizing the transmission distance, or the number of intervening transmission events between sampled individuals, is often a goal in genomic epidemiology when interest lies in linking infected individuals based on genetic data as a means to infer who might have infected whom (Jombart et al. 2014; Worby et al. 2016; Wohl et al. 2021). The goal of our second MDP therefore is to minimize the transmission distance between sampled individuals in a transmission tree (Fig. 3). Nodes in the transmission tree represent infected individuals and edges linking nodes describe who infected whom through transmission events. We define the transmission distance $[eqn]$ to be the number of edges traversed along the shortest path connecting nodes i and j in a transmission tree $[eqn]$ . A direct transmission pair will therefore have $[eqn]$ .

The transmission tree distance MDP. At each decision time t, the agent decides what action at to take in terms of which individuals to sample. The state st is determined by the set of all individuals sampled up to t, where ∅ represents a set with no samples. The reward function quantifies the reward gained from sampling individuals at a given transmission distance dmin(i). Here, we illustrate computing the rewards based on inverse distances where the reward received from sampling an individual is equal to the inverse of the transmission distance between the newly sampled node and its nearest sampled neighbor.

Markov process

The topology of the transmission tree is determined by who infects whom in the host population. For simplicity, we assume infections occur in non-overlapping generations with $[eqn]$ infections present in generation t. In each generation, the parent node of each newly infected node is chosen by randomly sampling nodes in the previous generation with uniform probability and with replacement. The epidemic dynamics of how $[eqn]$ changes through time is assumed to be deterministic. The process generating the tree is therefore equivalent to a Wright-Fisher process with time-varying population size.

Action policy

At each generation t, the agent decides how many individuals $[eqn]$ to sample according to the policy $[eqn]$ and then randomly samples a set $[eqn]$ of individuals. Because the transmission distance between samples will depend on exactly which individuals are sampled, we assume that the agent remembers the full set of individuals $[eqn]$ sampled at generation t.

State space

The state $[eqn]$ of the system is given by the sample configuration $[eqn]$ composed all individuals sampled up to generation t. Note that in this case we assume decision time points occur on the same timescale as generations in the transmission process.

Reward function

The reward the agent receives for sampling is determined by the transmission distance between each sampled individual and its nearest sampled neighbor in the transmission tree. We denote the transmission distance between i and its nearest sampled neighbor as $[eqn]$ . For a given set of sampled individuals $[eqn]$ :

[eqn]

We explore two reward functions that differ in how the rewards depend on $[eqn]$ . We refer to the first as the inverse distance reward function because it assumes the value of a sample is inversely proportional to its distance to its nearest sampled neighbor:

[eqn]

We refer to the second as the direct pairs reward function as it assumes that only direct transmission pairs are of interest. In this case:

[eqn]

Expected rewards

In order to compute the expected value of sampling a given individual, we first compute the probability $[eqn]$ that the transmission distance between the individual and its nearest sampled neighbor in $[eqn]$ is k. In brief, computing $[eqn]$ requires integrating over all possible transmission trees, all possible locations (i.e. nodes) at which the newly added sample could attach in these trees, and the corresponding distances between these possible attachment nodes and the sample. Full details on how $[eqn]$ is computed are provided in Supplementary Fig. 2 and Supplementary Text 3.

To compute the expected value of a sample, we sum the rewards received at each possible transmission distance k weighted by the probability $[eqn]$ that the sample has a nearest sample at that distance. Assuming we sample a single individual i, the expected reward is:

[eqn]

However, an additional complication arises here because sampling at present may change the distance of previous samples to their nearest sample. For example, we may sample a direct child of an individual sampled one generation in the past, converting their transmission distance to one. We therefore need to compute the total expected value of sampling all previously and newly sampled individuals $[eqn]$ :

[eqn]

To compute the total expected value gained from sampling a set of individuals $[eqn]$ at present, we then compare the rewards expected with and without the new samples in $[eqn]$ :

[eqn]

Structured coalescent MDP

Finally, we consider a MDP for sampling individuals under a structured coalescent model (Fig. 4) where individuals can migrate between two populations (Notohara 1990; Beerli and Felsenstein 1999). For an infectious pathogen, migration events may correspond to transmission events between different host populations. Our goal is to optimize sampling in order to maximize information about the estimated migration rates.

The structured coalescent MDP. At each decision time t, the agent decides on an action at that determines how many individuals are sampled from each population. The state st is determined by the number of samples taken from each population at all times up to and including t. The reward function quantifies how many coalescent events occur between populations as a proxy for the information gained about the estimated migration rates. Thus sampling individual A provides a reward r=1 as it adds a coalescent event occurring between populations, whereas sampling individual B does not and therefore provides a reward r=0. Because we do not know when and where sampled lineages will coalesce, the expected rewards are computed by integrating over all possible coalescent times τ and possible states of the parent and child lineages.

Markov process

The coalescent process underlying the MDP is similar to the exponential growth coalescent model except now individuals can migrate between populations. We assume that migration events occur through birth or transmission events between populations and that migrations cannot occur independently of births. The migration rate $[eqn]$ is defined as the probability that a child in population j has a parent in population i in the previous generation (Nordborg 1997). The proportion of individuals descended from parents in the same population is therefore $[eqn]$ . The size of each population $[eqn]$ may deterministically vary through time independent of the other population.

Going backwards through time, pairs of lineages coalesce at a rate $[eqn]$ that depends on what populations i and j they reside in:

[eqn]

where here i denotes the state of the parent lineage and j the state of the child lineage.

Action space

At each decision time point t, the agent decides how many individuals $[eqn]$ to sample in populations i and j according to the policy $[eqn]$ . To simplify the action space, we again consider a small set of sampling fractions in each population. The set of all possible actions $[eqn]$ therefore includes all pairwise combinations of sampling fractions between populations.

State space

The state $[eqn]$ of the system is determined by the sample configuration $[eqn]$ , where now $[eqn]$ .

Reward function

In theory, Fisher information could again be used to quantify information about the estimated migration rates. However, computing Fisher information under a structured coalescent model does not appear to be analytically tractable as computing the likelihood requires integrating over the unknown state (e.g. population or location) of all lineages in the phylogeny (Volz 2012; Rasmussen et al. 2014; Müller et al. 2017). We therefore sought a simpler reward function to quantify information about migration rates.

We first reasoned that inferences about migration rates are largely drawn based on information about the probable state of lineages at the time of coalescent events. For example, if two coalescing lineages have a high probability of being in different populations, we can infer that the coalescent event likely occurred at a birth or transmission event between populations as opposed to an event within either population. Combining information about the probable ancestral state of lineages at coalescent events across a phylogeny allows for migration rates to be estimated, while allowing for uncertainty in the ancestral states. Based on this logic, we further reasoned that the information gained from a phylogeny about migration rates will depend on the number of coalescent events that occur between as opposed to within populations. For example, if we are interested in estimating the migration rate $[eqn]$ , we expect that the number of coalescent events $[eqn]$ that occur where the parent lineage is in population i and one of the child lineages is in population j will be a reasonable proxy for the information gained about $[eqn]$ .

For our two-population model, we estimate the migration rates $[eqn]$ and $[eqn]$ corresponding to migration in both directions. Our reward function therefore quantifies the number of migration events that occur in either direction:

[eqn]

We square root transform the number of events as this was found to better predict the error in estimated migration rates in practice (see Fig. 8).

Expected rewards

As before, sampling each individual adds one additional coalescent event to the phylogeny. However, we do not know when this coalescent event will occur nor the states of the lineages involved at the time of the event. In order to compute the expected value of sampling a given individual, we therefore track the probability density $[eqn]$ for the time τ of the coalescent event along with the state of the parent lineage u and the state of the sampled child lineage v at the coalescent event, conditional upon the current sample configuration $[eqn]$ . Full details on how $[eqn]$ is computed are provided in Supplementary Fig. 3 and Supplementary Text 4.

Given $[eqn]$ , the expected value of sampling one individual in population i ( $[eqn]$ ) can by computed by considering the total probability that the coalescent event added by sampling occurs between lineages in different populations ( $[eqn]$ and $[eqn]$ or $[eqn]$ and $[eqn]$ ) while integrating out the unknown coalescent time:

[eqn]

When sampling more than a single individual at one time, we rewrite $[eqn]$ as the tuple $[eqn]$ to distinguish between samples taken in previous generations and the samples taken at present in different populations. The full expected value of sampling $[eqn]$ individuals can then be built up iteratively by considering the value of sampling one individual from each population at a time, conditional on $[eqn]$ :

[eqn]

Modeling the cost of sampling

A trade-off between the information gained from sampling (i.e. the reward) and the cost of sampling naturally arises in most settings (Masoero et al. 2022; Wohl et al. 2023). When optimizing sampling strategies, we therefore incorporate the costs of sampling into our reward functions by considering a net reward function:

[eqn]

where $[eqn]$ represents an application-specific reward function and $[eqn]$ represents the corresponding cost of taking action a from state $[eqn]$ . For simplicity throughout, we assume a fixed cost per sample such that costs increase linearly with sampling effort at each time point: $[eqn]$ .

To help ensure costs and rewards are measured on a comparable scale, we rescale both the costs and rewards relative to their maximum possible values among all policies being considered. As a result, a reward $[eqn]$ implies the maximal possible amount of information has been gained. Likewise, a cost $[eqn]$ implies that no more expensive strategy exists in policy space.

Value of information

Rewards and costs will not always be directly comparable on any natural scale (i.e. monetary value). Moreover, we may sometimes value the information gained from sampling more than its actual monetary cost (Canessa et al. 2015). For example, we may value sampling individuals with short transmission distances more if the information we gain about transmission sources can help prevent future infections. For the transmission tree MDP, we therefore explore different values of information ω that re-weight the relative value of the rewards relative to the costs:

[eqn]

Implementation and validation

Python code implementing all three MDPs alongside the policy evaluation and value iteration algorithms is available at: https://github.com/davidrasm/opt-sampling-mdp. The implementation of the MDPs was validated by comparing analytical results obtained by dynamic programming against Monte Carlo simulations of the MDP. For the exponential growth coalescent MDP, trees were simulated in msprime v1.3.3 (Baumdicker et al. 2022). For the transmission tree distance MDP, trees were simulated in Python with the ttd_mdp.py script. For the structured coalescent MDP, trees were simulated in SLiM v4.3 (Haller and Messer 2019) using the script WF_assymetric_migration.slim. In all three cases, simulated trees were stored and manipulated as tree objects in tskit v0.6 (Kelleher et al. 2018). Maximum likelihood estimates of model parameters were obtained from simulated trees using previously described likelihood functions for the exponential growth (Kuhner et al. 1998) and structured coalescent models (Volz 2012; Müller et al. 2017).

Results

Optimal sampling to estimate population growth rates

We first consider how the exponential growth coalescent MDP can be used to optimize sampling strategies for estimating population growth rates. Here, we use a reward function based on Fisher information $[eqn]$ to quantify the information gained about the growth rate γ from the sampled data. To explore how the rewards depend on sampling times and sizes, we first consider sampling at single time points (Fig. 5a). Overall, the amount of information gained about the growth rate decreases exponentially as we sample closer to the present, such that sampling more in the distant past generally provides substantially more information than sampling equivalent numbers of individuals near the present. The expected rewards computed analytically under the MDP closely track the rewards observed in Monte Carlo simulations.

Expected rewards under the exponential growth coalescent MDP. Time is measured in units of 1/γ going backwards from the present at τ=0. a) The expected reward or Fisher information gained from sampling different numbers of individuals at different time points computed analytically (solid lines) versus the rewards returned from Monte Carlo simulations (dashed lines). b) The optimal sampling strategy obtained through value iteration. The optimal sampling action at each time is marked by a gold diamond. The expected net reward of each possible action is computed conditional on making the optimal action at all other times points. In all cases, N(0)=50 and the growth rate γ=0.1.

Next, we use the value iteration algorithm to find the optimal sampling strategy when sampling sequentially through time. More specifically, we consider the problem of choosing the optimal sampling fraction from $[eqn]$ at six sampling time points uniformly spaced between the present and the time $[eqn]$ at which $[eqn]$ . The optimal action in terms of sampling fractions goes from $[eqn]$ at the earliest sampling time furthest in the past to $[eqn]$ at present. Intuitively, this strategy is optimal because sampling earlier results in coalescent events occurring deeper in the past, which provide more Fisher information about the growth rate, while the cost of sampling higher fractions also increases as the population size grows towards present.

To ensure the optimal strategy we identified is indeed globally optimal, we performed a brute force search over all $[eqn]$ possible strategies in the policy space. For each policy, we evaluate the expected value analytically using policy evaluation and then compared this expected value against the mean reward returned from 100 Monte Carlo simulations of the MDP. Computing the reward returned by each policy against its corresponding cost reveals an apparent optimality front beyond which the informational rewards cannot be increased without paying higher costs (Fig. 6a). Nevertheless, there are a large number of policies that yield high Fisher information for relatively low cost. Policies with high net reward all concentrate sampling deep in the past. The optimal policy found by this brute force search is indeed the same as the optimal policy identified by value iteration. Moreover, rewards computed analytically under the MDP are highly correlated with the the mean reward returned by Monte Carlo simulations of each policy (Pearson correlation $[eqn]$ ; Fig. 6b).

Performance of the exponential growth coalescent MDP at computing informational rewards. a) The rewards (Fisher information) versus costs of sampling across all 15,625 policies considered. Colors indicate the net reward returned for each policy and the optimal policy found by both the value iteration and a brute force search is marked by the gold diamond. b) Mean reward of each policy from Monte Carlo simulations versus expected reward computed analytically under the MDP. c) Expected Fisher information rewards versus root mean squared error (RMSE) of maximum likelihood estimates (MLEs) of the growth rate. d) Error in MLEs predicted based on expected Fisher information versus the observed RMSE.

To explore how well Fisher information predicts our ability to estimate growth rates, we obtained a maximum likelihood estimate (MLE) of the growth rate γ under the coalescent model for each Monte Carlo simulation. As expected, the root mean squared error (RMSE) in the MLE declines asymptotically with the expected Fisher information (Spearman rank correlation $[eqn]$ ; Fig. 6c). Moreover, because the error in our MLEs should be inversely related to the square root of the Fisher information $[eqn]$ , we can use this relationship to predict the error in the MLEs ( $[eqn]$ ). As expected, Fisher information accurately predicts the RMSE in the MLEs (Pearson correlation $[eqn]$ ; Fig. 6d).

We further explored optimal sampling strategies under a broader range of scenarios (Table 2). The optimal strategy is invariant to the population growth rate: low ( $[eqn]$ ), medium ( $[eqn]$ ) and high ( $[eqn]$ ) growth rates all result in the same optimal strategy. The optimal strategy changes slightly with the final population size $[eqn]$ . Larger population sizes shift the optimal strategy to sampling even less at time points midway between the earliest and latest sampling time points. Lastly we consider scenarios where the cost of sampling decreases linearly through time, such that sampling at the earliest time is 2 to 10-fold higher than sampling at present. Such a scenario may occur, for example, if resources are limited at the beginning of an epidemic. Scenarios where the cost of sampling is 5-fold or 10-fold higher at the earliest times shift the optimal strategy slightly away from sampling at intermediate times towards sampling small fractions at present. However, the optimal strategy always remains to sample as much as possible at the earliest time points when the population is still small. Sampling towards present adds little additional value, even if the cost of sampling substantially decreases over time.

Optimal sampling to minimize transmission distances

We next consider how to optimize sampling in order to minimize the distance between sampled individuals using the transmission tree distance MDP. To this end, we consider two different reward functions to quantify the distance between sampled individuals. Rewards based on inverse distances assume that the value of a sample is inversely proportional to its distance to its nearest sampled neighbor. Rewards based on direct pairs assume that a reward is only received for sampling direct transmission pairs.

We first consider the rewards versus costs of sampling a fixed fraction of individuals every generation in a constant sized population. Assuming the costs grow linearly with the sample fraction, the rewards grow more slowly than the costs under both reward functions (Fig. 7a,b). This is especially clear when rewards are computed based on direct transmission pairs, as the expected number of direct pairs grows with the square of the sample fraction ( $[eqn]$ ) and thus rewards grow quadratically with sampling effort.

Expected rewards versus costs of sampling uniform fractions of individuals in the transmission tree MDP. In A and C, rewards are computed assuming the value of a sampled individual is inversely proportional to its distance to its nearest sampled neighbor. In B and D, rewards are computed based on the number of direct transmission pairs sampled. Rewards are computed both analytically under the MDP (solid lines) and by taking the mean across 100 Monte Carlo simulations (dashed lines). a) Rewards versus costs of different sampling fractions with rewards based on inverse weighting. b) Rewards versus costs of different sampling fractions with rewards based on direct pairs. c) Expected net rewards with rewards based on inverse weighting assuming different values of information (VOI), which weights the relative value of rewards versus the costs. d) Expected net rewards with rewards based on direct pairs.

Assuming the rewards are directly comparable to costs such that the value of information (VOI) $[eqn]$ , the net reward function is concave, with intermediate sample fractions providing the lowest net reward for both forms of the reward function (Fig. 7c,d). In this case, the expected net reward of sampling either no one or everyone is very close to zero, such that both extreme sampling strategies can be considered optimal. However, when the value of information is low ( $[eqn]$ ), sampling no one is optimal. Conversely, if the value of information is higher ( $[eqn]$ ), sampling everyone becomes the optimal strategy. The optimal sampling fraction therefore strongly depends on the assumed value of information, but sampling low to intermediate fractions of individuals is generally the least optimal strategy. This is true regardless of the assumed population size because costs are computed relative to the total cost of sampling everyone.

We further explore optimal sampling strategies when sampling fractions are allowed to vary through time using value iteration. To reduce the size of the policy space, we consider deciding among 5 sampling fractions $[eqn]$ at 5 consecutive generations. Even when allowing for sampling fractions to vary through time, the optimal strategy generally remains to sample either no one or everyone depending on the assumed value of information (Table 3). An exception to this general trend occurs at the earliest sampling time (deepest in the past) when the optimal action can deviate from the optimal action at other times. This is most likely due to a boundary effect created by assuming all individuals share a single non-sampled ancestor one generation before the earliest sample time. For instance, at low VOI $[eqn]$ , it may be worth sampling some individuals at the earliest time due to the fact that all individuals will necessarily share a common ancestor at the previous generation and thus have low transmission distances.

We further considered optimal sampling strategies when population sizes doubled between each sampling time, starting from a population size $[eqn]$ at the first sampling time deepest in the past. Again, the optimal strategy is generally to sample no one at low VOI and sample nearly everyone at higher VOIs (Table 3). The exception to this trend occurs at the earliest sampling time, when the population size is very small, where it may still be optimal to sample even at low VOI. Conversely, at the latest sampling time when the population size is large, it may be optimal to sample fewer individuals even when the VOI is high.

To ensure the optimal strategies identified by value iteration are indeed optimal, we performed a brute force search over all $[eqn]$ possible strategies under each scenario considered above using the mean reward returned by 100 Monte Carlo simulations to approximate the expected value of each policy. The optimal policy identified by a brute force search is in most cases identical to the one identified by value iteration. In cases where the optimal strategy differs there are generally multiple strategies with nearly identical expected values. Across all policies, the rank order of polices computed analytically under the MDP is strongly correlated with the rank order according to the Monte Carlo simulations (Spearman rank correlation $[eqn]$ ).

Optimal sampling to estimate migration rates

Finally, we consider how the structured coalescent MDP can be used to optimize sampling strategies for estimating migration or transmission rates between populations. To quantify how much information sampling provides about the migration rates, our reward function counts the expected number of coalescent events in the phylogeny of sampled individuals that occur between as opposed to within populations. As before, we first consider the simple case of sampling at a single time point, varying the fraction of individuals sampled in each population. The reward increases linearly as the sampling fraction increases in both populations and thus more coalescent events occur between instead of within populations (Fig. 8a). The rewards computed analytically under the MDP closely match the mean rewards returned from Monte Carlo simulations of the structured coalescent (Fig. 8b).

Expected rewards under the structured coalescent MDP. a) The expected reward of sampling varying fractions of individuals in each population at a single time point computed analytically under the MDP. b) The mean rewards observed from Monte Carlo simulations of the structured coalescent. In both A and B, the number of migration events is square root transformed. c) The average root mean squared error (RMSE) of maximum likelihood estimates (MLEs) of the migration rates from 100 simulations at each pair of sampling fractions. The reported RMSE is averaged over migration rates estimated in both directions (i.e. RMSE=(RMSE(m12)+RMSE(m21))/2). d) The net reward of sampling when considering the costs of sampling. In all simulations, N=50 in each population and migration is symmetric between populations (κ=1.0) with the migration rates m12=m21=0.02 per generation.

We also explored how the root mean squared error (RMSE) in the maximum likelihood estimates (MLEs) of the migration rates vary with the sampling fraction in both populations. Averaged across 100 Monte Carlo simulations at each pair of sampling fractions, the RMSE increases as the sampling fraction decreases in both populations (Fig. 8c). We further noticed that the RMSE tends to scale with the square root of the the number of coalescent events between populations. We therefore square root transform the number of events in the reward function (see (19)). With this reward function, the optimal strategy is to sample relatively low fractions of individuals symmetrically from both populations (Fig. 8d).

We further consider optimal sampling strategies when sampling fractions are allowed to vary between populations as well as across time using value iteration (Table 4). We consider deciding among $[eqn]$ sampling fractions in each population, such that there are now $[eqn]$ possible actions at each time. We therefore only consider sampling at three time points to limit the policy space to a feasible size.

With low and symmetric migration rates ( $[eqn]$ ), the optimal policy remains to sample relatively low fractions (f=0.16 - 0.18) in both populations through time. Slightly higher sampling fractions are optimal at earlier time points deeper in the past, presumably because this increases the chances of lineages sampled at later time points coalescing at events between populations. With higher but still symmetric migration rates ( $[eqn]$ and $[eqn]$ ), the optimal strategy shifts slightly towards sampling higher fractions of individuals in both populations, reflecting the fact that there will on average be more migration occurring and thus more reward to be gained at higher migration rates.

We also consider optimal sampling strategies with asymmetric migration between populations, which we quantify as the ratio of the migration rates $[eqn]$ . Higher κ values imply migration is predominantly occurring from population 1 into 2, such that population 1 acts as source and population 2 as a sink. With moderately asymmetric migration ( $[eqn]$ ), the optimal strategy shifts to preferentially sampling more in the sink than in the source population, especially at later time points. This strategy may be optimal because individuals sampled in the sink population are likely to have a recent ancestor and thus coalesce with lineages in the source population, whereas individuals sampled in the source population will likely coalesce with other lineages sampled in the source population, providing little additional information about migration. With highly asymmetric migration ( $[eqn]$ ), preferentially sampling in the sink population becomes even more favored.

To ensure the sampling strategies identified by value iteration are globally optimal, we performed a brute force search over a subset of 729 out of a total of $[eqn]$ million possible strategies under each scenario considered above. In almost all cases, the optimal policy identified by the brute force search is closest to the policy identified by value iteration. Ranking policies based on their total expected value or net reward, the rank order determined by policy evaluation is strongly correlated with the rank order based on Monte Carlo simulations (Spearman rank correlation $[eqn]$ ). As expected, the the total predicted rewards are strongly negatively correlated with the RMSE in MLEs of the migration rates (Pearson correlation $[eqn]$ ), suggesting that our reward function does indeed provide a reliable proxy for the information gained about migration rates from simulated phylogenies.

Discussion

By framing sampling in population genomics as a sequential decision making process, we can apply the well-developed machinery of MDPs to explore optimal genomic sampling strategies. The resulting MDP framework overcomes one of the main challenges in designing genomic sampling strategies—MDPs allow us to predict the expected reward or value of genomic sampling in terms of what information will be gained about estimated variables. Because the information gained from sampling will generally depend on how sampled individuals are related, we combined MDPs with coalescent models that probabilistically consider how sampled individuals are related in a phylogeny or transmission tree to compute expected rewards. Once we can compute the expected value of sampling, the problem of designing optimal strategies reduces to finding actions or policies that maximize long-term expected rewards, which can be done very efficiently using dynamic programming.

Our MDP models provide several insights into how genomic sampling can be optimized for standard inference problems in population genomics and genomic epidemiology. Moreover, while the models we considered are relatively simple, many of these insights appear to generalize across the different scenarios and model parameterizations we explored. First, the exponential growth MDP shows that optimal sampling strategies for estimating population growth rates tend to concentrate sampling as early as possible, regardless of the growth rate or final size of the population. This remains true even if the costs of sampling decrease significantly over time. While our MDP model assumed deterministic population growth, it is likely that the need to sample early will extend to stochastic population dynamics and therefore inference under birth-death phylodynamic models. When population sizes are small, randomness in the timing of individual birth and death events can lead to “stuttering” of growth (Boskova et al. 2014; Stadler et al. 2015). Increasing early sampling would therefore likely improve estimates of growth in the face of this stochasticity, whereas sampling later once the population has reached a sufficient size to grow in a more deterministic manner will likely provide little additional information. Samples collected during the early stages of an epidemic are therefore likely to be most valuable for estimating other epidemiological parameters related to population growth such as $[eqn]$ or the relative fitness (i.e. selection coefficient) of emerging variants (de Silva et al. 2012; Inward et al. 2022).

Second, the transmission tree MDP reveals that sampling intermediate fractions of infected individuals can be least optimal when the goal is to minimize the transmission distance or number of intervening infections between sampled individuals. This main result appears to hold regardless of the exact assumptions about population dynamics (constant versus doubling population sizes) or how transmission distances are weighted in the reward function. While this result may seem intuitive in hindsight, it highlights a potential “sunk cost fallacy” in genomic epidemiology. It may be tempting to sample more if we have already invested heavily in sampling, but increased sampling may ultimately yield diminishing returns, especially if we are starting from low sampling fractions, which is often the case in genomic epidemiology. However, both of the reward functions we consider increase quadratically with sample size, whereas costs increase linearly due a constant cost per sample. The constant cost assumption may be unrealistic in some settings, especially if some individuals are more difficult to sample or some samples are technically more difficult to sequence. In this case, sampling the most difficult individuals will be disproportionally costly, shifting the optimal strategy to lower sampling fractions. It is therefore worth bearing in mind that the optimal strategy can be quite sensitive to the assumed reward and cost functions, which will generally need to be tailored to each specific application.

Third, the structured coalescent MDP provides insight into how genomic sampling can be optimized to infer migration rates. The ability to infer movement is often one of the most useful applications of genomic data, as such phylogeographic approaches can shed light on the ancestry of a population or, for an infectious pathogen, possible sources of transmission such as the fraction of infections attributable to given a subpopulation (Volz et al. 2013). Reassuringly, relatively low sampling fractions across populations are often optimal even if migration is highly asymmetric. These results support other recent work demonstrating that accurate estimates of migration rates can be obtained under structured coalescent models, even when sample sizes are small and biased between populations (Layan et al. 2023).

While our simple MDPs provide insights into common sampling problems, applying our MDP framework to more complex scenarios will undoubtedly come with challenges. One major challenge lies in designing reward functions that can meaningfully capture and quantify the information gained from sampling while remaining feasible to compute. When the objective is to minimize error or uncertainty about estimated parameters, rewards based on metrics such as Fisher information are a sensible and theoretically well-motivated choice, as Fisher information can be used to predict the accuracy and precision of estimates prior to sampling (Efron and Hinkley 1978; Ly et al. 2017). In other cases, designing reward functions may be much more challenging. For instance, we only realized after much trial and error that a relatively simple reward based on the number of coalescent events occurring between populations could be used as a good proxy for information gained about migration rates. We subsequently noticed that the error in the estimated migration rates does not decline linearly with the number of migration rates, but with the square root of the number of migration events. Interestingly, this appears to be a common feature of inference under coalescent models where the log likelihood of the tree generally takes the form of a sum of the log likelihood of individual events of some countable type (Parag and Pybus 2019). For example, Fisher information about population sizes increases linearly with the number of coalescent events and thus sample size. This suggests to us that the number of events in a phylogeny can serve as a proxy to quantify information even when computing Fisher information is intractable. However, since the predicted error in the maximum likelihood estimates scales with the square root number of events, one may generally want to square-root transform the number of events (or Fisher information) if interest lies primarily in reducing estimation error. A square root transformation has the additional benefit of making the net reward function more convex in the situation where both rewards and costs increase linearly with the sample size.

Another challenge concerns designing reward functions to capture the informativeness of genetic data about the ancestral or phylogenetic relationships among individuals upon which downstream demographic inferences are based. Our current framework circumvents this challenge by integrating over all possible ancestral relationships when computing the expected amount of information gained about demographic variables. However, the genetic variation present in actual sequence data may be too little or too great to reconstruct these phylogenetic relationships. Fortunately, extending our MDP framework to quantify the information gained about the underlying genealogy should be relatively straightforward. For example, the expected genetic distance between samples could be computed by multiplying the mutation rate by the expected distribution of divergence times between samples. The reward function could then be reweighted according to the expected number of mutations or informative sites (i.e. SNPs) separating sampled individuals. Such an approach could first quantify how much information is expected to be gained about the underlying phylogenetic tree and then quantify how much information is expected to be gained about demographic variables being inferred from the tree.

In the long-term, MDPs will likely need to be combined with reinforcement learning (RL) methods to realize their full potential in optimizing genomic sampling. In RL, a computational agent representing a decision maker learns how to act to maximize long-term rewards by interacting with its environment (Sutton and Barto 2018). While MDPs provide the theoretical basis for much of RL, modern RL algorithms overcome many of the challenges inherent to learning optimal policies from MDPs in complex environments. Perhaps most importantly, RL algorithms do not require perfect knowledge of the underlying dynamics of the system (i.e. the Markov process) since they learn to act optimally based purely on experience. In settings where there may be no simple mapping between actions and rewards, RL can be further combined with Q-learning methods to learn how to predict q-values or the expected reward received from taking a given action from a given state (Li 2017; Clifton and Laber 2020). These Q-learning approaches could then be used to optimize genomic sampling by training the agent to predict what new information will be gained from sampling and then acting accordingly.

Going forward, the simple MDPs presented here can be extended to address many of the challenges confronting genomic sampling design in real-world applications. Potential applications include population dynamics beyond exponential growth, which may require adaptive sampling through time (Cappello and Palacios 2022). In genomic epidemiology, the transmission tree MDP could be extended to include unobserved or asymptomatic infections that would require optimizing testing strategies alongside optimizing genomic sequencing strategies. In phylogeography, our two-population structured coalescent model could be extended to consider the movement of individuals through many subpopulations or across complex landscapes. Finally, one could extend the MDP framework to consider recombination, in which case the ancestry of the sample cannot be captured by a single tree, but rather a mosaic of different local trees as caputred in an ancestral recombination graph (ARG) (Griffiths and Marjoram 1996; Lewanski et al. 2024). Such an extension could then be applied to quantify how much information can be gained from ARGs about difficult-to-estimate parameters, such as recombination rates (Stumpf and McVean 2003), and potentially disentangle the information gained about demographic processes that shape ancestry across the entire genome from information gained about evolutionary processes like selection that impact only certain regions of the genome (Stern et al. 2021; Brandt et al. 2024). In each of these scenarios, MDPs could provide analytical insights in simple cases, which could be built upon by applying RL methods to confront more complex scenarios.

Supplementary Material

iyaf244_Supplementary_Data

Bibliography65

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alagoz O, Hsu H, Schaefer AJ, Roberts MS. 2010. Markov decision processes: a tool for sequential decision making under uncertainty. Med Decis Making. 30:474–483. 10.1177/0272989 X 09353194.20044582 PMC 3060044 · doi ↗ · pubmed ↗
2Atkinson AC . 1996. The usefulness of optimum experimental designs. J R Stat Soc Series B Stat Methodol. 58:59–76. 10.1111/j.2517-6161.1996.tb 02067.x. · doi ↗
3Attwood SW, Hill SC, Aanensen DM, Connor TR, Pybus OG. 2022. Phylogenetic and phylodynamic approaches to understanding and combating the early SARS-Co V-2 pandemic. Nat Rev Genet. 23:547–562. 10.1038/s 41576-022-00483-8.35459859 PMC 9028907 · doi ↗ · pubmed ↗
4Azouri D et al 2024. The tree reconstruction game: phylogenetic reconstruction using reinforcement learning. Mol Biol Evol. 41:msae 105. 10.1093/molbev/msae 105.38829798 PMC 11180600 · doi ↗ · pubmed ↗
5Baumdicker F et al 2022. Efficient ancestry and mutation simulation with msprime 1.0. Genetics. 220:iyab 229. 10.1093/genetics/iyab 229.34897427 PMC 9176297 · doi ↗ · pubmed ↗
6Beerli P, Felsenstein J. 1999. Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics. 152:763–773. 10.1093/genetics/152.2.763.10353916 PMC 1460627 · doi ↗ · pubmed ↗
7Bellman R . 1954. The theory of dynamic programming. Bull Am Math Soc. 60:503–515. 10.1090/bull/1954-60-06. · doi ↗
8Bellman R . 1957. A Markovian decision process. J Math Mech. 6:679–684. https://www.jstor.org/stable/24900506