A stochastic simulation-based approach to inform the relapsing mouse model study design for non-clinical assessment of tuberculosis
James Clary, Jessica K. Roberts, Debra Hanna, Alessia Tagliavini, Sylvie Sordello, Anna Upton, David Hermann, Alexander Berg

TL;DR
This paper presents a simulation-based method to optimize the design of mouse studies for testing tuberculosis treatments, reducing the number of animals needed while maintaining data quality.
Contribution
A novel stochastic simulation approach to evaluate and optimize relapsing mouse model study designs for TB drug development.
Findings
Using simulations, 28% fewer mice can be used in RMM studies without significant loss of precision.
Alternative study designs maintain low bias and ±1–2 week precision for estimating T95 for most regimens.
The method supports improved animal stewardship while generating reliable data for decision-making.
Abstract
The development of new regimens to treat tuberculosis (TB), the disease caused by Mycobacterium tuberculosis, is critical to improving patient outcomes and decreasing global infectious disease mortality. Early evaluation of candidate regimens in non-clinical models of TB, such as the relapsing mouse model (RMM), remains an important step in prioritizing the most efficacious regimens for further clinical evaluation. Although RMM studies may be informative, they are also animal-, labor-, and time-intensive to complete and represent a significant investment in time and resources during non-clinical development. Given the strong pipeline of regimens in development, identification of “leaner” RMM studies may have a significant impact on resource utilization, and hence we compared alternative study designs to identify study attributes that can be modified to improve resource use, particularly…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig 1
Fig 2
Fig 3
Fig 4
Fig 5| Design | Description | Number of regimens | Treatment durations | Mice per duration | Total mice |
|---|---|---|---|---|---|
| Baseline | Original design | 14 | 0.5, 1, 1.5, 2, 2.5, 3 | 6 | 504 |
| “Ultimate” | High-performance benchmark | 14 | 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5 | 30 | 4,200 |
| Proposed 1 | Omit 2-week duration | 14 | 1, 1.5, 2, 2.5, 3 | 6 | 420 |
| Proposed 2 | Remove three study arms (HRZE, Regimen 4, and Regimen 11) | 11 | 0.5, 1, 1.5, 2, 2.5, 3 | 6 | 396 |
| Proposed 3 | Reduce to 5 mice per duration | 14 | 0.5, 1, 1.5, 2, 2.5, 3 | 5 | 420 |
| Proposed 4 | Reduce to three mice at 2 weeks and five mice at other durations | 14 | 0.5, 1, 1.5, 2, 2.5, 3 | 5 | 392 |
| Regimen | Parameter | ||
|---|---|---|---|
| Gamma (unitless) | T50 (months) | T95 (months) | |
| BPaMZ | 2.34 | 1.63 | 2.16 |
| HRZE | 2.13 | 4.00 | 5.69 |
| Regimen 1 | 1.92 | 2.99 | 4.61 |
| Regimen 2 | 2.03 | 2.73 | 4.01 |
| Regimen 3 | 2.10 | 3.04 | 4.37 |
| Regimen 4 | 2.17 | 1.92 | 2.68 |
| Regimen 5 | 2.44 | 2.56 | 3.30 |
| Regimen 6 | 2.17 | 1.70 | 2.38 |
| Regimen 7 | 2.03 | 2.03 | 2.99 |
| Regimen 8 | 1.68 | 2.16 | 3.74 |
| Regimen 9 | 1.50 | 2.22 | 4.28 |
| Regimen 10 | 2.56 | 1.72 | 2.16 |
| Regimen 11 | 1.51 | 3.26 | 6.26 |
| Regimen 12 | 1.92 | 3.21 | 4.94 |
| Name | Description | Treatment durations (months) | Mice per duration | Total mice |
|---|---|---|---|---|
| Baseline | Original design, 36 mice per arm | 0.5, 1, 1.5, 2, 2.5, 3 | 6 | 504 |
| Proposed 5 | 26 mice per arm, middle durations favored | 0.5, 1, 1.5, 2, 2.5, 3 | 5 mice at 1, 1.5, 2, and 2.5 months; | 363 |
| HRZE arm only: 25 mice | 2, 3, 4 | 5 mice at 2 months; | ||
| Proposed 6 | 37 mice per arm, middle and later durations favored | 0.5, 1, 1.5, 2, 2.5, 3 | 8 mice at 2, 2.5, and 3 months; | 509 |
| HRZE arm only: 28 mice | 2, 3, 4 | 8 mice at 2 months; |
| Regimen | Parameter | ||
|---|---|---|---|
| Gamma (unitless) | T50 (months) | T95 (months) | |
| BPaMZ | 2.03 | 1.48 | 1.97 |
| HRZE | 1.95 | 4.02 | 6.11 |
| Regimen 1 | 2.33 | 1.18 | 1.57 |
| Regimen 2/3 | 2.00 | 2.28 | 3.40 |
| Regimen 4/12 | 2.00 | 1.89 | 2.64 |
| Regimen 5 | 2.17 | 1.61 | 2.05 |
| Regimen 6 | 2.00 | 1.67 | 2.34 |
| Regimen 7 | 1.60 | 1.87 | 2.79 |
| Regimen 8 | 1.42 | 1.20 | 2.18 |
| Regimen 9 | 2.52 | 1.26 | 2.57 |
| Regimen 10 | 2.33 | 1.69 | 2.14 |
| Regimen 11 | 2.17 | 1.34 | 1.78 |
| Regimen | “True” values | T95 estimate (months) | T95 bias (months) | |||||
|---|---|---|---|---|---|---|---|---|
| T95 rank order | T95 (months) | Baseline design | Proposed design 5 | Proposed design 6 | Baseline design | Proposed design 5 | Proposed design 6 | |
| Regimen 1 | 1 | 1.57 | 1.70 (1.36, 1.97) | 1.68 (1.4, 1.92) | 1.68 (1.39, 1.92) | 0.13 (−0.21, 0.40) | 0.11 (−0.17, 0.35) | 0.11 (−0.18, 0.35) |
| Regimen 11 | 2 | 1.78 | 1.96 (1.71, 2.31) | 1.93 (1.63, 2.27) | 1.91 (1.66, 2.2) | 0.17 (−0.08, 0.53) | 0.15 (−0.15, 0.49) | 0.13 (−0.12, 0.42) |
| BPaMZ | 3 | 1.97 | 2.02 (1.87, 2.13) | 2.00 (1.84, 2.12) | 2.00 (1.86, 2.14) | 0.05 (−0.09, 0.16) | 0.03 (−0.13, 0.15) | 0.03 (−0.11, 0.17) |
| Regimen 5 | 4 | 2.05 | 2.35 (1.97, 2.7) | 2.33 (1.95, 2.64) | 2.27 (1.94, 2.56) | 0.30 (−0.08, 0.65) | 0.28 (−0.10, 0.59) | 0.23 (−0.11, 0.52) |
| Regimen 10 | 5 | 2.14 | 2.43 (2.11, 2.77) | 2.39 (2.12, 2.67) | 2.33 (2.06, 2.65) | 0.29 (−0.03, 0.63) | 0.24 (−0.02, 0.53) | 0.19 (−0.08, 0.50) |
| Regimen 8 | 6 | 2.18 | 1.95 (1.59, 2.32) | 1.87 (1.49, 2.28) | 1.93 (1.51, 2.34) | −0.23 (−0.58, 0.14) | −0.31 (−0.68, 0.11) | −0.25 (−0.67, 0.16) |
| Regimen 6 | 7 | 2.34 | 2.44 (2.00, 2.72) | 2.4 (1.96, 2.71) | 2.37 (1.95, 2.78) | 0.10 (−0.33, 0.39) | 0.07 (−0.38, 0.38) | 0.04 (−0.39, 0.44) |
| Regimen 9 | 8 | 2.57 | 1.95 (1.53, 2.39) | 1.91 (1.46, 2.41) | 1.95 (1.57, 2.36) | −0.62 (−1.04, −0.18) | −0.66 (−1.11, −0.16) | −0.62 (−1.00, −0.21) |
| Regimen 4 | 9/10 | 2.64 | 2.74 (2.40, 3.12) | 2.74 (2.35, 3.13) | 2.7 (2.24, 3.00) | 0.10 (−0.25, 0.47) | 0.09 (−0.29, 0.48) | 0.06 (−0.40, 0.35) |
| Regimen 12 | 9/10 | 2.64 | 2.78 (2.44, 3.24) | 2.77 (2.35, 3.18) | 2.74 (2.36, 3.09) | 0.14 (−0.20, 0.59) | 0.13 (−0.29, 0.53) | 0.09 (−0.28, 0.44) |
| Regimen 7 | 11 | 2.79 | 2.71 (2.29, 3.14) | 2.68 (2.30, 3.03) | 2.68 (2.27, 3.03) | −0.08 (−0.50, 0.35) | −0.11 (−0.49, 0.24) | −0.11 (−0.52, 0.24) |
| Regimen 2 | 12/13 | 3.40 | 3.33 (2.89, 3.77) | 3.20 (2.70, 3.73) | 3.25 (2.88, 3.63) | −0.07 (−0.51, 0.36) | −0.20 (−0.70, 0.33) | −0.15 (−0.53, 0.23) |
| Regimen 3 | 12/13 | 3.40 | 3.29 (2.90, 3.77) | 3.31 (2.83, 3.69) | 3.23 (2.85, 3.64) | −0.11 (−0.51, 0.37) | −0.09 (−0.57, 0.29) | −0.17 (−0.55, 0.23) |
| HRZE | 14 | 6.11 | 6.14 (6.05, 6.19) | 6.14 (6.05, 6.21) | 6.15 (6.04, 6.21) | 0.03 (−0.06, 0.08) | 0.03 (−0.06, 0.10) | 0.04 (−0.07, 0.10) |
- —Bill and Melinda Gates Foundationhttp://dx.doi.org/10.13039/100000865
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTuberculosis Research and Epidemiology · Animal testing and alternatives · Malaria Research and Control
INTRODUCTION
Tuberculosis (TB), the disease caused by Mycobacterium tuberculosis, continues to remain a global health challenge affecting approximately 10.8 million individuals with an estimated mortality of approximately 1.25 million in 2023. While this is an improvement over previous years, which were heavily impacted by the COVID-19 pandemic, global progress in combating TB remains slow. Although WHO guidelines for drug-susceptible TB have been updated to allow for a 4-month daily regimen of high-dose rifapentine, moxifloxacin, isoniazid, and pyrazinamide (RIPE), the previous standard of care regimen of isoniazid, rifampin, pyrazinamide, and ethambutol (HRZE) regimen (2-month combination of HRZE followed by 4 months of isoniazid and rifampin) remains as first-line therapy (1). This highlights both the need for additional research into the development of new TB regimens as well as the potential for treatment shortening through new combination regimens, as the 4-month RIPE regimen can provide adequate treatment in a subset of patients while shortening treatment by 2 months (2). A treatment duration of 3–4 months or less total has been identified by the WHO as the minimal requirement for new regimens targeting drug-susceptible TB, with the optimal duration targeted at 2 months or less for the development of a pan-TB regimen (i.e., first-line therapy for any active TB, regardless of strain) (3).
A key consideration in the development of a shorter-duration pan-TB regimen is that any such regimen is expected to include one or more novel drugs that are combined and dose-optimized for maximal efficacy and safety (3). Given the strong pipeline of new compounds, efficiently testing combinations early in development is critical to identify those that are most promising as candidate pan-TB regimens. Non-clinical studies, such as the BALB/c relapsing mouse model (RMM) of pulmonary TB, are important tools to help inform potential regimen selection to expedite and prioritize promising agents for the clinic. The RMM has been extensively used to guide the selection of candidate regimens for further development by evaluating by quantifying the proportion of treated mice exhibiting relapse following administration of selected drug combinations (4). Our recent work has demonstrated that the utility of the RMM in assessing relative regimen performance is greatly enhanced through the use of a model-based meta-analysis approach, whereby data from multiple studies are pooled and analyzed simultaneously. Specifically, we employed a mixed-effects logistic regression approach to analyze data from 28 RMM studies to determine the treatment duration-dependent relapse probability for a given regimen (5). This method enabled the calculation of main metrics included in treatment duration required to reach relapse probabilities of interest (e.g., time to 5% relapse probability, or inversely, the time to 95% probability of cure), while also quantifying and adjusting for the impact of study-level covariates on treatment response. Importantly, as the magnitude of inter-study variability was quantified, our approach allows for “apples-to-apples” comparison of all regimens across studies.
Although model-based meta-analysis is readily extendible to include emerging data from new and ongoing studies, an important caveat is that reliable estimation of treatment effects depends upon the size and informativeness of the underlying data set. The ability of the model to estimate relative regimen performance relied heavily on data pooling to provide sufficient data for analysis. Even with data pooling, the available data for certain regimens (e.g., only from a single-treatment duration) were too sparse to precisely estimate the key metrics of interest. These data limitations in part reflect that RMM studies have historically been designed to favor relatively large numbers of animals to enable statistical comparisons across groups at a limited set of treatment durations (6). Such studies, although statistically robust, were not designed to generate data that would be maximally informative for estimation of model-based parameters. Rather, when analyzed using a longitudinal model-based analysis, RMM studies should generate data suitable for accurate estimation of each regimen’s treatment duration vs. relapse probability curve. Key study attributes that may impact the data generated for estimation purposes include overall sample size, total mice per regimen, number and distribution of mice by treatment duration (timepoint), number of regimen arms in the study, inclusion of historical controls, range of regimen response (efficacy), and other study-level covariates that may influence treatment response (e.g., inoculum size) (5).
To understand the relationships between study attributes and data informativeness for model-based analysis, we have performed an in silico model-based evaluation of RMM study design. This builds upon our previous work by using the mixed-effects modeling framework to simulate the outcomes of virtual RMM studies and then “re-estimating” the simulated data using an updated model-based approach to generate measures of bias and precision for model-based parameters. By comparing the results of the simulation/re-estimation outputs across simulations, the impact of selected study attributes (i.e., number of mice per arm, number of mice per timepoint, and regimen selection/efficacy) was investigated to inform potential modifications to RMM study design. The overall goal was to identify study designs that would not only generate more informative data for model-based analysis but would also remain logistically feasible, cost-effective, and promote minimal animal use.
RESULTS
Comparative performance of RMM study designs
Simulation Round 1
An initial round of simulations (Simulation Round 1) compared a baseline design with a set of proposed alternative designs and a high-performance benchmark (viz. “ultimate”) design (Table 1). All of the proposed designs, including the baseline design, already represent a departure from the typical historical design in that they allocate the total number of animals over a greater number of timepoints; for example, in the Baseline design, there are six mice assigned to each of six treatment durations (36 mice per arm), whereas in a historical design (see e.g., Tasneen et al. [7]), there would be half as many treatment durations with more than two times the number of mice evaluated at each. This was done purposefully to ensure that sufficient timepoints would be available to enable model-based estimation of the cure probability curve, and because such designs are already being implemented in practice (8). Twelve hypothetical regimens were simulated with intermediate efficacy relative to the two control arms (BPaMZ [bedaquiline, pretomanid, moxifloxacin, and pyrazinamide] and HRZE), which were also included. The simulated number of regimens was based on proposed RMM studies in development at the time of the simulation work and is representative of a relatively large RMM study. Corresponding model parameter values for all regimens are shown in Table 2, and the simulated (“true”) cure probability versus treatment duration curves are shown in Fig. 1. It should be noted that the regimens depicted in Fig. 1 were fully hypothetical as they were not based on any regimens that had been studied as of the time of the analysis, but were generated to represent plausible profiles for de novo regimens to be evaluated in future RMM studies.
“True” cure probability versus treatment duration profiles for regimens used in Simulation Round 1. BPaMZ and HRZE (dotted lines) were control arms for the studies. All others were hypothetical regimens intended to represent plausible profiles. A dashed black line indicates 95% cure probability.
Overall, the baseline study design and proposed alternatives performed similarly across regimens, with no marked degradation in estimation of time to 95% probability of cure (T_95_, equivalent to the time to 5% probability of relapse) for any of the latter designs despite decreasing the number of mice by 17%–22% or, in “Proposed 2,” also decreasing the number of regimens in the study by 3. In all cases, a median negative bias of up to 2 weeks was observed, as was marked variability between regimens in the range of bias seen across simulations (Fig. 2). This may also be seen in overlays of “true” relapse probability versus treatment duration profiles with the distribution of profiles estimated from the simulations (Fig. S3 to S16). The hypothetical regimens with the poorest accuracy and precision in T_95_ estimation were those that were the least efficacious; that is, those regimens where the steep part of the curve was mostly beyond the last time point (Regimens 1, 3, 11, and 12). These regimens exhibited median bias values in the range of −1 to −4 weeks and large interquartile ranges (IQR) across all study designs. In contrast, those regimens that exhibited the sharpest increase in cure probability (Regimens 4, 5, 6, 7, and 10) were relatively well estimated, with IQRs in the range of ±2 weeks.
Bias plot of T95 (months) by regimen and study design from Simulation Round 1. Boxplots represent IQR, and lines represent 1.58 IQR; points are outliers outside this range. Lines represent 0 and ±0.25 months (±1 week) bias.*
All study designs were able to reliably estimate the T_95_ value for the BPaMZ control arm, which was the most efficacious regimen with a similarly steep profile. Although the BPaMZ estimation was also supported by experimental data included in the analysis data set during re-estimation, in the case of the HRZE control arm (ranked 13 out of the 14 regimens in terms of efficacy), including historical data were not sufficient to overcome the bias associated with having insufficient data points in the upper portion of the cure probability vs. treatment duration curve (viz., beyond 4 months). Furthermore, omission of the HRZE control in “Proposed 2” did not result in worse performance vs. other designs which included both controls. This suggests that the inclusion of more than one control arm may not be required to anchor the model to inform inter-study variability estimates, though it is acknowledged that within-study comparisons to the HRZE historical standard of care regimen may be of interest to provide confidence in study results.
Of note, while minimal differences were observed between the baseline and proposed alternatives, marked differences were observed when compared to the “ultimate” design. This implausibly large design, which included four additional treatment durations beyond 3 months and 5–6 times the number of mice per duration, was included to represent a high-end benchmark. In other words, the ultimate design represents the upper limit of what could likely be achieved if there were no logistical or cost constraints. Only the “ultimate” design was able to capture profiles where most of the curve is after 3 months, such as Regimen 12 (Fig. S16).
Simulation Round 2
The second round of simulations consisted of a focused assessment of baseline design versus two additional proposed alternative designs (“Proposed 5” and “Proposed 6”) that were selected based on the results from the first round of simulations (Table 3). As it was of specific interest to evaluate more potent anti-TB regimens in the RMM model (i.e., with similar performance to BPaMZ), a new set of hypothetical regimens was selected with T_95_ values between 1.75 and 3.5 months (Table 4; Fig. 3). It is noted that the parameters used for simulation of BPaMZ and HRZE profiles, while similar, were not identical between Simulation Rounds 1 and 2 (compare values in Tables 2 and 3). This reflects the iterative nature of the meta-analysis approach used for RMM data modeling, as data from additional studies featuring these regimens that became available were incorporated into the data set and the model re-estimated between simulation rounds.
“True” cure probability versus treatment duration profiles for regimens used in Simulation Round 2. BPaMZ and HRZE (dotted lines) were control arms for the studies. All others were hypothetical regimens intended to represent plausible profiles. A dashed black line indicates 95% cure probability.
As in the first round of simulations, the original and proposed designs performed similarly across regimens for T_95_ estimation (Fig. 4; Table 5), although smaller bias and greater precision were observed for all hypothetical regimens in Simulation Round 2. These improvements are attributed in part to regimen and timepoint selection, as T_95_ values fell within the range of timepoints for all but two hypothetical regimens in the second round. Additionally, the inclusion of data from additional studies and a change in estimation approach from maximum likelihood estimation to Bayesian analysis using Markov Chain Monte Carlo (MCMC) in Simulation Round 2 also improved bias and precision. This is evidenced by the improvements seen for the BPaMZ and HRZE control arms when comparing the results for the Baseline design across simulation rounds (refer to Fig. 2 and 4 for comparison). It is noted that the latter analysis approach is consistent with current methods employed by our group for RMM data analysis and illustrates how model-based approaches may be improved as analyses are repeatedly updated to incorporate and adapt to emerging data.
Bias plot of T95 (months) by regimen and study design from Simulation Round 2. Boxplots represent IQR, and lines represent 1.58 IQR; points are outliers outside this range. Lines represent 0 and ±0.25 months (±1 week) bias.*
In Simulation Round 2, most regimens showed a median bias within ±1 week with an IQR of approximately 1 week. Regimens with a bias outside this range included Regimens 5 and 10, which showed a median positive bias at or above +1 week for Baseline and Proposed 5 designs. These regimens exhibited similar T_95_ values of approximately 2.1 months and “steep” profiles as indicated by the small difference between T_95_ and the midpoint of the curve (given by the T50 estimate: 0.44 and 0.45 months for Regimen 5 and Regimen 10, respectively). Other “steep” hypothetical regimens that had a less than 0.5-month difference between the midpoint and T_95_ included Regimens 1 and 11, which also showed a small positive bias. In contrast, Regimen 9, which was the “shallowest” hypothetical regimen (T_95_ = 2.57 months, time between T_50_ and T_95_ = 1.31 months), had a negative bias between −2 and −3 weeks and an IQR completely falling below −1.5 weeks for all study designs. While this suggests that both the shape of the profile and the overall efficacy may influence the direction and magnitude of bias, given that the IQR is narrow, in general, the overall bias in model estimates remains small relative to the required treatment duration to achieve high cure rates (i.e., median absolute bias ranging from 0.03 to 0.66 months vs. T_95_ values of 1.6 to 3.4 months). When viewed as a Forest plot, a typical way of comparing regimen performance (Fig. 5), there is generally good agreement between the “true” value and model estimates, with the former generally lying within the 5th and 95th percentiles of T_95_ values estimated across replicates. Although some differences in T_95_-based regimen rank order are indicated (Table S1), this was expected given that T_95_ values for the most efficacious hypothetical regimens differ only on the order of days, whereas bias and precision estimates are on the order of weeks. For example, those that were the most incorrectly ranked (i.e., had a median rank that was more than one position different from their “true” rank), Regimens 5, 8, 9, and 10, all had “true” T_95_ values within an approximately 2-week range. In this situation, rank order assessment is challenging and can be particularly misleading as minor differences in estimation can result in ranking errors even though the magnitude of the estimation bias is small. However, as differences between “true” and observed rank order were seen with all proposed study designs, the simulations do not suggest a clear advantage of one design vs. another in terms of regimen rank order assessment.
Comparison of “true” and estimated T95 values by regimen and study design from Simulation Round 2. Regimens are organized in descending rank order from least to most efficacious regimen based on the “true” value (red point). The black point (range) represents the median (5th and 95th percentiles) of T95 estimates across simulation replicates.
DISCUSSION
A simulation/re-estimation approach was successfully applied to compare RMM study designs in their relative performance for estimating key metrics of interest (viz., T_95_); although this is a widely used approach, we note that alternative approaches (e.g., Fisher Information Matrix-based approaches) may also be suitable for this type of analysis and, in the future, may be considered for further refinement. In general, the original and proposed alternative designs evaluated in this simulation study showed similar performance to each other for hypothetical and control regimens, with low bias for most regimens and a precision for T_95_ estimation within ±1 to 2 weeks. This is despite differences in the total number of mice, timepoints, mice per timepoint, and/or number of regimens. The only significant improvements in the simulation study were seen with the “ultimate” design (N = 4,200 mice) included in the first simulation round, a completely unrealistic design which was included to benchmark the “best” possible performance. While the results for this design suggest that further improvements may be possible for regimens where the T_95_ value lies beyond the range of treatment durations evaluated in the study, it is noted that such regimens would likely be less efficacious and therefore considered as lower priority. This obviates the return on the investment in more treatment durations and/or mice per treatment duration, especially to the level indicated with such an implausibly large study design. Of the other designs, although some improvement was suggested in the second simulation round with “Proposed 6” (N = 509 mice, similar in size to the Original Design) showing smaller bias for “steep” regimens (i.e., Regimens 5 and 10), the improvement was minor as compared to “Proposed 5” (N = 363 mice, the smallest design evaluated). This is important in that animal stewardship continues to be an important criterion in the development of research programs (9, 10), especially when large numbers of animals are planned for terminal sacrifice in critical non-clinical experiments such as the RMM. In this simulation study, we demonstrated that by using a model-based analysis approach and tuning the allocation of mice across more informative timepoints, a large 28% reduction in the number of mice (i.e., Proposed Design 5 versus the Baseline Design) can be achieved with minimal impact on T_95_ estimation. Moreover, these and additional (unpublished) simulations indicate that further reduction in animal usage could be achieved through other selected modifications (e.g., decreasing the numbers allocated to control arms or dropping one of the control arms from the design altogether). Taken as a whole, the results presented herein demonstrate that “leaner” RMM study designs that are more cost-effective and potentially more logistically feasible can significantly decrease animal use while remaining highly informative for decision-making.
A key assumption in this comparative assessment is that all RMM studies conducted using the designs outlined herein would be analyzed using a longitudinal model-based approach. This methodology “fits” the relapse proportions at each treatment duration into a smooth curve for each regimen and treats the entire study as an experimental group for analysis. Such regression-based approaches are focused on obtaining reliable parameter estimates for cross-regimen comparison as compared to historical methods of analysis focused on statistical analysis to determine statistically significant differences in relapse proportions at specific timepoints. As the latter approach requires significantly larger sample sizes to obtain statistical significance (6), the proposed study designs would not be sufficiently powered for such analyses. In addition, the modeling approach used for estimation assumes a shared structure for all regimens (i.e., that all regimens follow the same sigmoidal relapse curve). While this two-parameter structural model represents the most parsimonious model to describe the relapse probability vs. treatment duration profiles and has been valid in analyses to date (1–3), this assumption should be reviewed as additional data are generated.
Another consideration is that the specific approach utilized by our team is a model-based meta-analysis, which utilizes historical RMM study data to anchor model estimates and uses data from control arms as a bridge between the new and previous studies. The benefits of this method are that large data sets help to improve parameter estimates for fixed effects (i.e., treatment regimen and covariate parameters) and random effects (i.e., inter-study variability parameters), and that emerging data are analyzed in the context of previous RMM studies, thereby allowing for cross-study comparisons. This represents a potential limitation, as it is acknowledged that access to RMM study data sets may not be universal, and it is unlikely that data sets containing all relevant data will be available on an ongoing basis without significant investment in data management and data sharing (although it is noted that several RMM studies included in this and previous work (5) are accessible via the TB-Platform for the Aggregation of Preclinical Experiments Data [TB-APEX] database) (11). To address this limitation, rather than rely solely on historical data to stabilize parameter estimates, the MCMC Bayesian analysis methodology employed in Simulation Round 2 has the advantage of allowing the incorporation of prior information related to model parameter distributions. Although in this study non-informative “priors” were used for the hypothetical regimens, relevant information regarding likely regimen performance (i.e., parameter values), covariate effects, or random effects can be incorporated using “informative” priors and therefore help to overcome data limitations. This advantage, as well as the multiple technical improvements observed in this study (e.g., better parameter estimation, shorter run times, and improved model stability), has led to the adoption of MCMC as our standard approach for analysis of emerging RMM study data.
In summary, using a simulation-based approach, we were able to demonstrate that alternative RMM study designs were able to produce similar performance in calculating metrics of interest from a model-based analysis. By adjusting key design elements, including mice per treatment duration and the total number of treatment durations evaluated, one proposed study design (Proposed Design 5) was able to reduce the total number of mice by 28% while still maintaining good precision compared to the baseline design. This study design has since been implemented for multiple ongoing (unpublished results) and completed RMM studies, including that reported by Sordello et al. (4–8) to successfully balance improved animal stewardship while providing informative data to support the non-clinical evaluation of new and novel regimens.
MATERIALS AND METHODS
Statistical model
The statistical model used for estimation purposes in this simulation/re-estimation study was an extension of the model of Berg et al. (5) as implemented by Sordello et al. (9–13). The updated model featured an inverse Emax-type structure defined by two key parameters, midpoint (T_50_) and shape (γ). These parameters were estimated separately by regimen to define the typical relapse probability vs. treatment duration profiles for each regimen after adjusting for covariate effects and inter-study variability. The general model, which also includes the effect of inoculum amount (INOC) as a covariate, is described by the following equations:
where:
P_i,j,k_ is the probability of relapse for the i^th^ regimen at the j^th^ month in the k^th^ study;
B is the baseline probability of relapse prior to treatment (fixed to 1, assumes 0% cure with no treatment);
Emax is the maximum effect for all regimens (fixed to 1, assuming 100% cure with continued treatment);
T_j_ is the j^th^ treatment duration (months);
T50i,k is the midpoint of the curve, the treatment duration to achieve 50% relapse or cure probability for the i^th^ treatment in the k^th^ study;
γ_ik_ is the shape parameter (Hill coefficient) for the i^th^ treatment in the k^th^ study;
INOCk is the inoculum covariate (log_10_ CFU) for the k^th^ study;
is the median inoculum value across all studies, corresponding to values of 3.65 and 4.01 log_10_ CFU for Simulation Rounds 1 and 2, respectively;
η_1_,k and η_2_,k are random effects of the k^th^ study for T50 and γ, respectively, assumed to be N(0,σ^2^) with an unstructured covariance matrix; and
y_i_,j,k,n is the relapse status (1 = relapse, 0 = cure) of the n^th^ mouse in the k^th^ study receiving the i^th^ treatment at month j.
Simulations
Simulations were performed to compare the baseline study design relative to proposed alternative designs that were logistically feasible, highly informative, cost-effective, and minimized animal use. Simulations were performed iteratively in two rounds, with the parameters evaluated in the second simulation round informed by results from the first simulation round. Hypothetical regimens (defined by T_50_ and γ values) were simulated to represent a range of plausible cure probability versus treatment duration profiles for de novo regimens. All simulations were performed in R v.4.0.3 as implemented via RStudio Workbench v1.4.1717-3 (12). For each replicate (equivalent to one virtual RMM study), η estimates for T_50_ and γ were generated from the variance-covariance matrix of the associated σ values (0.10 and 0.12, for T_50_ and γ, respectively) using the MASS package (13) (first round) or were sampled with replacement from study-specific (post hoc estimated) values obtained from previous RMM study analyses (second round [unpublished data]; corresponding σ values were 0.16 and 0.12 for T_50_ and γ, respectively). The resulting η values were input to the model equations along with prespecified T_50_ and γ for the selected hypothetical or control regimens and an inoculum covariate value of 4.5 log_10_ CFU (representative of the inoculum used for planned studies). The corresponding study-, covariate-, and regimen-specific relapse probabilities at specific timepoints (treatment durations) were then used to simulate an outcome (relapse status; 0 or 1 for cure or relapse, respectively) for each “virtual” mouse in the study by drawing from a binomial distribution.
Simulation Round 1
Twelve hypothetical anti-TB regimen designs were used to assess a range of sterilization rates considered plausible for the unknown drug regimens. Two existing drug regimens (BPaMZ and HRZE) were included as control regimens. Refer to Table 2 for the corresponding model parameters used for simulation of the various regimens. Six different study designs were assessed, including a baseline design corresponding to a real RMM study pending execution at the time, an “ultimate” design featuring implausibly high numbers of animals and timepoints (included to estimate the best possible performance characteristics), and four proposed alternative designs (namely Proposed 1 through 4). As preliminary simulations suggested that model convergence upon re-estimation could be low in certain simulation scenarios, a total of 1,000 replicates were simulated to ensure adequate numbers of studies with reliable parameter estimates.
Simulation Round 2
A separate set of 12 hypothetical anti-TB drug regimens was generated to compare the baseline design with two additional proposed alternative study designs (Proposed 5 and 6). BPaMZ and HRZE were retained as control arms, whereas two sets of regimens were duplicated (Regimens 2/3 and Regimens 4/12) to investigate the ability of the model to identify regimens with identical performance within the same study. Refer to Table 4 for the corresponding model parameters used for simulation of the various regimens. For each design, 200 replicates were simulated.
Model re-estimation from simulated data sets
Each simulated replicate data set was combined with available experimental RMM data to generate an estimation-ready data set for each replicate. The experimental data included that described in Berg et al. (5), data from the TB-Platform for the Aggregation of Preclinical Experiments Data (TB-APEX) database (11), and data from unpublished RMM studies. This was done to match how de novo study data is typically analyzed in practice by our group. The model was then re-estimated separately for each simulation iteration following the addition of regimen-specific T50 and γ parameters for the hypothetical regimens.
In the first simulation round, re-estimation was performed in NONMEM v7.3, with only models that achieved successful convergence included in the outputs. In the second simulation round, a Bayesian estimation approach (MCMC) was utilized using the statistical analysis software, Stan, as implemented by the RStan package in R (14). For MCMC analysis (four chains of N = 10,000 iterations, including 5,000 warm-up iterations each), normal distributions were used as prior for all model parameters. Weak priors were used for hypothetical regimen parameters (T50: mean = 1–2 months, standard deviation = 0.5 months; γ: mean = 2.3, standard deviation = 1), with informative priors for all other parameters (i.e., control regimen T50 and γ, inoculum covariate effect, and sigma [eta variance] parameters) based on mean and standard deviation estimates from the pooled meta-analysis data set. Goodness-of-fit plots and visual predictive checks were reviewed for all model-based fits, with MCMC-specific diagnostics also reviewed (i.e., appearance of “fuzzy caterpillars” in trace plots for all chains, absence/presence of divergent transitions, Rhat values < 1.01, and effective sample size ratios >0.8).
For reference, visual predictive checks stratified by RMM study index are provided in Fig. S1 and S2, which show model performance relative to the historical data for the models used Simulation Rounds 1 and 2, respectively, and demonstrate sufficient performance for estimation and simulation purposes.
Comparison of designs
Following re-estimation, T_95_ (time to 95% probability of cure) values were calculated for each replicate and each individual regimen, using equation 5:
These metrics were summarized by design and compared to the true simulation to assess bias and overall predictive performance of the design. Bias for each simulation replicate was calculated as shown in equation 6.
where E(T_95_)i,j is the estimated T95 value obtained from a given simulation replicate for each regimen, i, and study design, j, with θi denoting the “true” (input) T95 value for each simulated regimen. For each replicate, estimated cure probability versus treatment duration curves were simulated from the model estimates to obtain a distribution for each study design to graphically compare model estimates with the true values. Finally, each replicate’s T_95_ values were ranked, and the order was compared to the value used for simulations across designs to assess the ability of each design to differentiate regimens. All graphical representations were generated in R using the ggplot2 package (15).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1World Health Organization. 2024. Global tuberculosis report 2024
- 2Carr W, Kurbatova E, Starks A, Goswami N, Allen L, Winston C. 2022. Interim guidance: 4-month rifapentine-moxifloxacin regimen for the treatment of drug-susceptible pulmonary tuberculosis - United States, 2022. MMWR Morb Mortal Wkly Rep 71:285–289. doi:10.15585/mmwr.mm 7108 a 135202353 · doi ↗ · pubmed ↗
- 3World Health Organization. 2023. Target regimen profiles for tuberculosis treatment 2023 update. Geneva World Health Organization. https://www.who.int/publications/i/item/9789240081512.
- 4Franzblau SG, De Groote MA, Cho SH, Andries K, Nuermberger E, Orme IM, Mdluli K, Angulo-Barturen I, Dick T, Dartois V, Lenaerts AJ. 2012. Comprehensive analysis of methods used for the evaluation of compounds against Mycobacterium tuberculosis. Tuberculosis (Edinb) 92:453–488. doi:10.1016/j.tube.2012.07.00322940006 · doi ↗ · pubmed ↗
- 5Berg A, Clary J, Hanna D, Nuermberger E, Lenaerts A, Ammerman N, Ramey M, Hartley D, Hermann D. 2022. Model-based meta-analysis of relapsing mouse model studies from the critical path to tuberculosis drug regimens initiative database. Antimicrob Agents Chemother 66:e 0179321. doi:10.1128/AAC.01793-2135099274 PMC 8923195 · doi ↗ · pubmed ↗
- 6Lenaerts AJ, Chapman PL, Orme IM. 2004. Statistical limitations to the cornell model of latent tuberculosis infection for the study of relapse rates. Tuberculosis (Edinb) 84:361–364. doi:10.1016/j.tube.2004.03.00215525559 · doi ↗ · pubmed ↗
- 7Tasneen R, Betoudji F, Tyagi S, Li S-Y, Williams K, Converse PJ, Dartois V, Yang T, Mendel CM, Mdluli KE, Nuermberger EL. 2016. Contribution of oxazolidinones to the efficacy of novel regimens containing bedaquiline and pretomanid in a mouse model of tuberculosis. Antimicrob Agents Chemother 60:270–277. doi:10.1128/AAC.01691-1526503656 PMC 4704221 · doi ↗ · pubmed ↗
- 8Sordello S, Brock L, Tagliavini A, Federico D, Boulenc X, Pergher M, Claustre EH, Metcalf D, Walter ND, Robertson GT, Clary J, Berg A, Mdluli K, Hermann D, Flood D, Upton AM. 2025. A modeling-based framework to evaluate forgiveness of TB drug combinations in a BALB/c relapsing mouse model. bio Rxiv. doi:10.1101/2025.08.07.668704 · doi ↗
