Benefits of group sequential design and sample size re-estimation for randomised controlled trials evaluating the prevention of ventilator-associated pneumonia: a simulation study informed by real world data

Holly Jackson; Julien Sauser; C. H. van Werkhoven; Stephan Harbarth; Marlieke E.A. de Kraker

PMC · DOI:10.1186/s12874-025-02681-4·November 12, 2025

Benefits of group sequential design and sample size re-estimation for randomised controlled trials evaluating the prevention of ventilator-associated pneumonia: a simulation study informed by real world data

Holly Jackson, Julien Sauser, C. H. van Werkhoven, Stephan Harbarth, Marlieke E.A. de Kraker

PDF

Open Access

TL;DR

This study shows that adaptive trial designs can make VAP prevention trials more efficient by reducing the number of patients needed.

Contribution

The paper demonstrates how group sequential and sample size re-estimation designs improve trial efficiency for VAP prevention.

Findings

01

Pocock boundaries reduce expected sample size but increase maximum sample size compared to O’Brien Fleming boundaries.

02

Sample size re-estimation maintains power better when initial prevention effect assumptions are incorrect.

03

Adaptive designs can reduce expected sample size by 9-12% in VAP prevention trials.

Abstract

Ventilator-associated pneumonia (VAP) is an important healthcare acquired infection, which is associated with high morbidity and mortality. Conducting conventional randomised controlled trials (RCTs) on VAP prevention is often challenging, due to low numbers of eligible patients and events per site, especially for pathogen-specific interventions. We explored how group sequential designs (GSD) and sample size re-estimation (SSR) trial designs could improve RCT efficiency in simulated superiority trials to prevent VAP. Simulations were informed using data from the prospective observational Hospital Network Study – Preparation for a Randomised Evaluation of anti-Pneumonia Strategies (HONEST-PREPS). We tested the impact of different GSD and SSR designs on expected sample size (considering early stopping) and maximum sample size (no early stopping). We varied the type of stopping boundary,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases4

infection Pneumonia critically ill VAP

Figures4

Click any figure to enlarge with its caption.

Comparison of Group Sequential Design (GSD) with O’Brien-Fleming boundaries (dashed) against Fixed Randomised Controlled Trial (solid) as timing of interim analysis (IA) placement changes showing (a) expected (Exp) number of events, (b) maximum (Max) number of events, (c) probability (Prob) of stopping for efficacy at the IA, and (d) Prob of stopping for futility at the IA, for five different prevention effects, for power of 80%

Comparison of Group Sequential Design (GSD) with Pocock boundaries (dashed) against Fixed Randomised Controlled Trial (solid) as timing of interim analysis (IA) placement changes showing (a) expected (Exp) number of events, (b) maximum (Max) number of events, (c) probability (Prob) of stopping for efficacy at the IA, and (d) Prob of stopping for futility at the IA, for five different prevention effects, for power of 80%

Comparison of Group Sequential Design (GSD) and Sample Size Re-estimation (SSR, utilising 5,000 simulations and an optimisation prevention effect, HR = 0.68) with O’Brien-Fleming boundaries against Fixed Randomised Controlled Trial (RCT) for (a) expected (Exp) number of events, (b) mean maximum (Max) number of events, (c) probability (Prob) of stopping for efficacy at the interim analysis (IA), (d) Prob of stopping for futility at the IA and (e) power, for a range of assumed HRs, when the true HR = 0.68, for power of 80%, with a single IA 64% through the trial

Funding1

—University of Geneva

Keywords

Randomised clinical trialAdaptive clinical trial designGroup sequential designSample size re-estimationVentilator-associated pneumonia

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNosocomial Infections in ICU · Sepsis Diagnosis and Treatment · Pneumonia and Respiratory Infections

Full text

Introduction

Emerging infectious diseases and antimicrobial resistance are growing threats, which require efficient approaches to evaluate novel preventive and therapeutic strategies. Unfortunately, the conventional randomised controlled trial (RCT) is challenging in these domains, often too few patients can be timely enrolled, resulting in an underpowered RCT and a waste of resources [1]. To increase benefit to patients, more efficient trials, such as adaptive trial designs, are needed. Although the COVID-19 pandemic has increased uptake, adaptive elements are still not routinely considered, especially in the field of infectious diseases. This is partly associated with their complexity and lack of understanding by clinicians, funders, and regulators [2].

Adaptive trial designs are more flexible than the conventional RCT, as they allow trial modifications based on pre-specified decision rules, allowing more efficient use of the collected data [3–5]. Group sequential designs (GSD) and sample size re-estimation (SSR) can be incorporated relatively easily within an RCT. In GSDs, pre-planned interim analyses (IAs) allow a trial to end early for efficacy or futility [6]. When applying a GSD: the number of IAs, their timing, and the stopping boundaries need to be pre-specified [7]. Although there exist investigations in the statistical literature to determine their optimal choices, particularly when the patient outcome is binary or normally distributed [8], there is little empirical data showing what would be optimal choices for time to event outcomes when considering the distribution of events and the accrual intensity, as these can vary depending on the particular case.

SSR is an extension to the GSD, where the sample size of the trial may be re-calculated at IA(s) [9], i.e. there is no fixed maximum sample size. Decisions about changing the sample size depend on pre-specified criteria, and at each IA in theory one of five decisions can be made; stop the RCT for evidence of efficacy, continue the RCT with a decrease in the planned sample size to ensure adequate power to detect a larger than anticipated effect size, continue as planned, continue with an increased sample size to ensure adequate power to detect a smaller than anticipated effect size, or stop the trial for futility if the difference between treatment arms is small and/or the sample size required to find a difference would be too large [10]. This type of design can be particularly helpful when there is uncertainty about the effect size of a novel intervention or uncertainty in the distribution of the outcome in the control arm, as the originally calculated sample size may be incorrect [11]. In addition to the parameters needed for a GSD, the SSR design also needs to pre-specify the timing of SSR and whether and to what extent the sample size can decrease or increase.

For adaptive designs, it is important that the endpoint is recorded quickly, relative to the trial recruitment rate, to allow the IA to make impactful decisions on the trial design [12]. The longer the delay in observing the outcome, relative to recruitment rate, the less efficient the design becomes. Therefore, when these designs are being considered, exploring the best time placement for IA(s) for the particular case (dependent on recruitment rate and time to event) is imperative. Recently, there have been several RCTs looking at the prevention of ventilator-associated pneumonia (VAP) [13–17]. GSD and/or SSR could be beneficial in this setting as VAP normally occurs relatively quickly after the start of the at-risk period (under invasive mechanical ventilation (IMV) for at least 48 h) with a median time to VAP diagnosis of less than 10 days [13–17].

The objective of this manuscript is to evaluate the feasibility and added value of GSD and SSR in RCTs focused on prevention of VAP, informed by real world data about VAP in the intensive care unit (ICU).

Methods

Data

The Hospital Network Study – Preparation for a Randomised Evaluation of anti-Pneumonia Strategies (HONEST-PREPS), provided the required data. It was a prospective cohort study enrolling patients ≥ 18 years old admitted to the ICU at risk of hospital-acquired pneumonia or VAP (expected or documented length of hospitalisation of more than 48 h) in 12 ICUs across 6 European countries from 2020 to 2022 (ClinicalTrials.gov ID: NCT05060718) [18–20]. All patients, or their legal representatives, had to provide informed consent for participation. In HONEST-PREPS, VAP was defined according to Food and Drug Administration (FDA) criteria [21]. For this current study, patients were excluded if they were not ventilated for at least 48 h during their ICU stay.

Simulation study

Setting

We compare the GSD and SSR designs to a conventional fixed RCT, in a parallel group, two-arm, superiority trial to assess the effectiveness of a new investigational intervention (Inv) compared to the standard of care (SoC) to reduce the incidence of VAP, among patients under IMV for at least 48 h. We assume a 1:1 allocation ratio. The performance measures of interest are the expected and maximum sample size, see below. The data from HONEST-PREPS was used to find the average accrual intensity, and fit models for VAP incidence, and distribution of VAP onset time, see the code in additional material 2. The values utilised within the simulation study were calculated from these fitted models.

Design choices

Simulations are separated into three individual parts. We first compared a fixed RCT to a GSD with O’Brien Fleming (OBF) and Pocock alpha- and non-binding beta-spending functions [22], with a single IA placed at various timings throughout the study. We chose non-binding beta-spending functions as they tend to be more common in clinical trial practice [23] and we assume early stopping at the IA for futility, if these boundaries are crossed throughout all our investigations. Pocock boundaries are narrower than OBF boundaries (Figure SM1), which means that the trial is more likely to stop at IAs performed at the same time point. However, a trial utilising the OBF boundaries is more likely to reject the null hypothesis at the final analysis. Secondly, the impact of the number of IAs (one to nine) used in a GSD with OBF boundaries was investigated. Here, we assumed the IAs would be equally spaced throughout the trial. Finally, we compared a fixed RCT to a GSD, SSR only allowing an increase in sample size (SSR-increase) and SSR allowing an increase or decrease in sample size (SSR-both) using OBF boundaries, with one IA, in two settings. Optimal timing of the IA was selected, assessed by finding the minimum number of expected events in a GSD.

In the first comparison, for GSD and fixed RCT designs, we used the “true” prevention effect to calculate their sample sizes. For both SSR designs, we started with a smaller study, powered to detect a realistic prevention effect – that was larger than the smallest clinically meaningful, prevention effect (underpowered), this was then increased appropriately at the IA [24]. We used an absolute prevention effect which was 5% larger than the true prevention effect in the investigational intervention arm, to calculate the initial sample size and then re-calculated the sample size, using the observed prevention effect at IA.

In the second comparison, we conceptualised a frequently occurring design dilemma when the expected prevention effect is larger than the minimal clinically relevant prevention effect. Investigators can initially base the sample size on the relevant effect size, resulting in a larger trial (“pessimistic” design), or on the expected effect size, resulting in a smaller trial that is underpowered if the true effect is equal to the relevant effect (“greedy” design). We contrasted the sample sizes of a fixed RCT, GSD, SSR-increase and SSR-both while exploring the impact of using an initial prevention effect to calculate the sample size of the trial, which was lower and higher than the “true” prevention effect for all four designs. Where the initial prevention effect is higher, this can be conceptualised as a “greedy” design, and the sample size should be re-estimated at IA within the SSR-design considering the smaller “true” effect. When the initial prevention effect is smaller, this would be a “pessimistic” design, and we re-estimate the sample size at IA within the SSR-design considering a larger “true” effect.

Furthermore, we applied the following assumptions to all trial designs: one sided type I error of 2.5%, patients are followed for a maximum of 1 month from enrolment (which is long enough to determine day 28 VAP rate), there are no dropouts, power of 80% (main text) and 90% (additional material 1). The GSDs utilised formal efficacy and non-binding futility stopping boundaries. The SSR designs utilised formal efficacy stopping boundaries, however, we could stop these trials for futility if the increase in sample size at interim analysis was too high or if the prevention effect observed at IA favoured the SoC, described in more detail below.

Sample size calculation

We used time-to-event methodology to compute the sample sizes assuming proportional hazards. The use of time to VAP diagnosis as an endpoint is preferred over a single measure of difference in risks between the treatment arms, as it also includes the time taken for VAP to occur and the potential variation in follow-up time [25], as in the trial [16]. Therefore, we used hazard ratios (HRs) to calculate the expected and maximum number of events needed for each trial design.

For a fixed RCT, Eq. 1 can be used to find the total number of events, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:n$$\end{document}$ , which produces a trial with power $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:(1-\beta\:)$$\end{document}$ and maintains one-sided type I error, α, where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{P}_{k}$$\end{document}$ is the proportion of participants allocated to each treatment arm $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:k$$\end{document}$ : SoC or investigational [26].

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:n=\frac{{({Z}_{\left(1-\alpha\:\right)}+{Z}_{(1-\beta\:)})}^{2}}{{P}_{Inv}\:\bullet\:\:{P}_{SoC}\:\bullet\:\:{\text{ln}\left(HR\right)}^{2}}$$\end{document}

Then the sample size is defined as the ratio between the number of events (i.e. the number of patients who develop VAP) and the probability of observing an event during the trial [25]. To compute the denominator, we assume a piecewise exponential distribution for VAP cumulative incidence, based on the HONEST-PREPS data. All analyses described below were performed using the statistical program ‘R’ version 4.4.2 [27], using the package ‘rpact’ [28] and the code is available in additional material 2.

Scenarios

We investigated these trial designs by utilising a series of five simulated parallel group, two-arm, superiority interventional trials. We explored a range of risk differences (based around a 32% reduction in 28-day VAP cumulative incidence as found in SAATELITE [14] and AMIKINHAL [16]). We derived the HR per scenario, by utilising the risk difference and the 28-day VAP cumulative incidence in the SoC arm estimated from the HONEST-PREPS data, Table 1.

Table 128-day cumulative incidence of ventilator-associated pneumonia (VAP) in the standard of care arm (SoC) and the investigational intervention (Inv) per scenarioScenarioCumulative incidence of VAP in SoC% Reduction in InvCumulative incidence of VAP in InvHazard Ratio 1 15.5%20%12.4%0.79 2 15.5%25%11.6%0.73 3 15.5%30%10.8%0.68 4 15.5%35%10.1%0.63 5 15.5%40%9.3%0.58

Performance measures

We report both the expected and maximum number of events, for each trial design, within each scenario explored. As the fixed RCT includes no IA, these two measures are equal.

The expected number of events, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:E\left[n\right]$$\end{document}$ , within a trial is defined as the average number of events (number of patients who develop VAP), given the probability of stopping the trial at each IA or continuing to final analysis [7].

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:E\left[n\right]={\sum\:}_{i=1}^{I}{p}_{i}\bullet\:{n}_{i}$$\end{document}

Here $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{p}_{i}$$\end{document}$ is the probability of stopping the trial after the $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:i$$\end{document}$ th analysis for all $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:i=\{1,\:2,\:\dots\:,\:I\}$$\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{\sum\:}_{i=1}^{I}{p}_{i}=1$$\end{document}$ and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:I$$\end{document}$ is the total number of planned IAs $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:+1$$\end{document}$ . Furthermore, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:{n}_{i}$$\end{document}$ is the number of events (across the SoC and investigational intervention arms) in all stages, 1 to $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:i$$\end{document}$ , of the trial for all $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\:i=\{1,\:2,\:\dots\:,\:I\}$$\end{document}$ .

In the GSD, the maximum number of events is the total number of events across both treatment arms and all stages of the trial, assuming the trial does not stop at any IAs.

In our setting, we focus on SSR using the inverse normal combination test design, with only one IA. At the IA, the trial can either (1) stop for efficacy using the formal OBF alpha-spending boundaries (2) stop for futility if the observed HR ≥ 1 (3) proceed to the second stage, where the number of events needed is recalculated based on the observed HR from stage 1. In each scenario we must assign a total maximum number of events which could be observed in the trial. This is chosen via an optimisation function utilising a simulation approach and an optimisation effect size, to ensure the conditional power equals the desired power over all simulations and we limit it to be at most double what is needed in a fixed RCT using the assumed HR. The optimisation effect size is set equal to the clinically relevant effect size with the “greedy” design, and to the expected effect size with the “pessimistic” design (see above). Across the simulations the SSR design can increase the maximum number of events in the trial up to the optimised total maximum number of events. The largest maximum number of events across all simulations is known as the total maximum number of events. The average of these maximum number of events is known as the mean maximum number of events and is used as a performance measure for SSR. If the conditional power will not reach the desired power without increasing the total maximum number of events past the limit, then we stop the trial at the IA for futility. Here, the term ‘conditional power’ refers to the proportion of the simulations where the null hypothesis is correctly rejected at any stage (interim or final analysis) given the data observed at the interim analysis, and the simulation concludes, correctly, the superiority of the investigational intervention [29].

Results

HONEST-PREPS recruited a total of 1,096 patients at risk of VAP, i.e. under IMV for at least 48 h, of which 170 patients (cumulative incidence 15.5%) developed VAP by day 28. Median time between start of IMV (+ 48 h) and VAP onset was 3 days (IQR: 1–6). Average accrual intensity was 47 patients/month. There were 5 ventilated patients who were lost to follow-up and censored at their last follow-up date.

Fixed RCT

In a fixed RCT, based on an effect size (HR) of 0.58–0.79, we need 106–566 events, and 747–2710 patients to show superiority of the investigational intervention (Table SM1 in the additional material 1).

Group sequential design

The GSD with one IA, utilising OBF boundaries to allow early stopping at the IA for efficacy and futility, produces a smaller expected number of events than the fixed RCT. When powering a trial at 80%, the optimal placement of the single IA is at 64% of the maximum number of events which is calculated for the GSD trial. This gives 41% and 11% probability of stopping the RCT at the IA for efficacy and futility, respectively, which is true across prevention effects (Figs. 1c-d). This results in an expected 502 events and 2470 patients, saving 64 events (11.3%) and 240 patients (8.9%), compared to the fixed RCT (566 events and 2710 patients) in scenario 1, when prevention effect is low (HR = 0.79). The distribution of required events is shown in Figure SM2a in the additional material 1. For scenario 5 (HR = 0.58), high prevention effect, there is a smaller absolute but comparable relative saving of 12 events (11.3%; Fixed = 106 vs. GSD = 94) and 71 patients (9.5%; Fixed = 747 vs. GSD = 676) (Fig. 1a, Table SM1 in the additional material 1). The distribution of required events is shown in Figure SM2b in the additional material 1.

The savings in expected sample size comes at the cost of an increase in the maximum sample size. If the trial does not stop at the optimal IA, it would require 51 more events (9.0%) and 186 more patients (6.9%), than in the fixed design (scenario 1), or 10 additional events (9.4%) and 60 additional patients (8.0%) (scenario 5) (Fig. 1b, Table SM1 in the additional material 1).

Fig. 1. Comparison of Group Sequential Design (GSD) with O’Brien-Fleming boundaries (dashed) against Fixed Randomised Controlled Trial (solid) as timing of interim analysis (IA) placement changes showing (a) expected (Exp) number of events, (b) maximum (Max) number of events, (c) probability (Prob) of stopping for efficacy at the IA, and (d) Prob of stopping for futility at the IA, for five different prevention effects, for power of 80%

When the power is 90%, optimal IA placement, relative gains in expected sample size and relative increase of maximum sample size are similar (Figure SM3, Table SM2 in the additional material 1).

The optimal placement of the single IA using the Pocock boundaries is 48% of the maximum number of events, when the power is 80%, with a 51% and 12% probability of early stopping for efficacy and futility, respectively. This results in a slightly smaller expected number of events (3.4%) and patients (3.6%) compared to using the OBF boundaries, for each HR evaluated (scenario 1: Pocock = 485 vs. OBF = 502 events and Pocock = 2380 vs. OBF = 2470 patients). Conversely, the OBF boundaries produce a reduction for the maximum number of events (14.3%) and patients (10.9%) (scenario 1: Pocock = 720 vs. OBF = 617 events and Pocock = 3251 vs. OBF = 2896 patients), Fig. 2 and Table SM3 in the additional material 1. The distribution of required events for a GSD with Pocock boundaries is displayed in Figure SM4 in the additional material 1. Similar results are displayed in Figure SM5 and Table SM4 in the additional material 1, when the power is 90%. Due to the possibility of having a larger study (if the trial does not stop at the IA) when utilising the Pocock boundaries, we focus on the OBF boundaries as preferred, throughout the rest of this manuscript.

Fig. 2. Comparison of Group Sequential Design (GSD) with Pocock boundaries (dashed) against Fixed Randomised Controlled Trial (solid) as timing of interim analysis (IA) placement changes showing (a) expected (Exp) number of events, (b) maximum (Max) number of events, (c) probability (Prob) of stopping for efficacy at the IA, and (d) Prob of stopping for futility at the IA, for five different prevention effects, for power of 80%

We further investigate the impact of changing the number of IAs from one to nine in a GSD utilising formal efficacy and non-binding futility OBF boundaries. Increasing the number of IA, increases the probability of early stopping for efficacy (one = 18% vs. nine = 76%) and futility (one = 7% vs. nine = 18%), which decreases the expected number of events (by one = 7.6% vs. nine = 20.1%) and expected number of patients (by one = 7.3% vs. nine = 16.7%) in comparison to the fixed RCT (scenario 1: 566 events and 2737 patients). This increases the maximum number of events (by one = 5.5% vs. nine = 19.8%) and increases the maximum number of patients (by one = 3.2% vs. nine = 13.5%) in comparison to the fixed RCT (Figures SM6-7 in the additional material 1).

Sample size re-estimation

For SSR, we used OBF efficacy stopping boundaries placed at the optimal timing of IA from the GSD, i.e. 64% of the maximum number of events, and aimed for 80% power. Furthermore, the trial would be stopped at IA for futility if the observed HR ≥ 1. We used an assumed prevention effect 5% larger (based on a 5% larger reduction in VAP in the investigational intervention arm) than the true prevention effect used in the simulations, to calculate the number of events and patients needed in the first and second stage of the trial. Then the prevention effect observed in the simulated data at the IA and the total maximum number of events found via the optimisation function were used to recalculate the number of events and patients needed in the second stage of the trial to reach a conditional power of 80% across 10,000 simulations. See Figures SM8-SM9 in the additional material 1 for more details.

Comparison of designs where the assumed prevention effect is correct

We investigated the SSR design for several true HRs (0.58, 0.68, 0.79), where the assumed HR used to calculate the initial sample size is purposely based on a prevention effect 5% larger than the assumed HR (HR 0.53, 0.63, 0.73, respectively) (Table 2). When we compare SSR to the GSD (with optimal IA timing, 64%) and fixed design (both using the true prevention effect), the adaptive trial designs give smaller expected numbers of events, with an increase in maximum numbers of events. For scenarios 3 and 5, the GSD produces the lowest expected number of events, while the fixed RCT produces the highest (Table 2). The adaptive designs reduce the expected number of events by 9.4%−11.3% and expected number of patients by 7.8%−9.5% in comparison to the fixed RCT. However, they increase the mean maximum number of events by 5.7%−9.4% and patients by 5.0-8.2% in comparison to the fixed RCT. For scenario 1, the increase needed in the number of events at the IA, to reach the conditional power for both SSR designs would be larger than the pre-determined limit. Therefore, if the SSR design simulations did not stop for efficacy, we stopped them for futility at the IA, producing the smallest expected and maximum number of events and patients across all designs.

Table 2. Comparison of Group Sequential Design (GSD), Sample Size Re-estimation (SSR)-increase only and SSR-both (utilising 10,000 simulations) with O’Brien-Fleming boundaries against Fixed Randomised Controlled Trial (RCT) with interim analysis (IA) placed at 64% of the trial, for three different prevention effects, for power of 80%. The Fixed RCT and GSD use the true hazard ratio (HR) in their sample size calculation, the SSR-increase only and SSR-both use the assumed HR for their initial sample size calculation and the HR observed in the simulated data to re-calculate the sample size at the IAScenarioAssumed HRTrue HRFixed RCTGSD OBFSSR -increaseSSR - bothExpected no. of events10.730.79566502206206Expected sample size10.730.792710247012261226Mean Maximum no. of events10.730.79NA617206206Mean Maximum sample size10.730.79NA289612261226Total Maximum no. of events10.730.79NANA206206Expected no. of events30.630.68212188192190Expected sample size30.630.681291117011851169Mean Maximum no. of events30.630.68NA231227224Mean Maximum sample size30.630.68NA138913771355Total Maximum no. of events30.630.68NANA287296Expected no. of events50.530.58106949696Expected sample size50.530.58747676686689Mean Maximum no. of events50.530.58NA116113115Mean Maximum sample size50.530.58NA807803808Total Maximum no. of events50.530.58NANA139151

Comparison of designs where the assumed prevention effect is incorrect

Figure 3 shows the situation where the assumed HR initially used for the sample size calculation (for all designs) is not equal to the true HR, and thus, the study is either underpowered, i.e. the effect of the investigational intervention was overestimated, or the effect of the SoC was underestimated or the study is overpowered, i.e. the effect of the investigational intervention was underestimated, or the effect of the SoC was overestimated.

When the assumed HR is further from the null than the true HR, the study is underpowered. The power of the GSD and fixed design is always below the target of 80%. However, the SSR designs allow the second stage sample size to increase enough, such that they attain the target power, 80%. Due to the limit we have chosen for finding the optimal total maximum number of events, the SSR designs only allow an increase in sample size when the assumed HR is relatively similar to the true HR. When the assumed HR is far from the true HR, we have a high probability of stopping the SSR designs at the IA for futility, as the increase in sample size is so large it would be more beneficial to run a completely new study, see also Figure SM10 in the additional material 1. Allowing the sample size to also decrease at the IA, requires an even larger increase in the total maximum number of events. The probability of stopping the trial early for efficacy is similar for the GSD and SSR, due to using similar efficacy stopping values. However, the probability of stopping for futility is very different, due to the different approaches used (GSD: stopping boundary vs. SSR: sample size increase being too large to be beneficial), left side of Fig. 3 and Figure SM10 in the additional material 1.

Fig. 3. Comparison of Group Sequential Design (GSD) and Sample Size Re-estimation (SSR, utilising 5,000 simulations and an optimisation prevention effect, HR = 0.68) with O’Brien-Fleming boundaries against Fixed Randomised Controlled Trial (RCT) for (a) expected (Exp) number of events, (b) mean maximum (Max) number of events, (c) probability (Prob) of stopping for efficacy at the interim analysis (IA), (d) Prob of stopping for futility at the IA and (e) power, for a range of assumed HRs, when the true HR = 0.68, for power of 80%, with a single IA 64% through the trial

When the assumed prevention effect used to initially calculate the sample size is equal to the true prevention effect, there is little difference between the GSD and SSR, shown in Fig. 3 (assumed HR 0.68) and Figures SM 10–11 (assumed HR 0.79 and 0.58 in the additional material 1). They give an expected number of events smaller than the RCT, and a larger mean maximum number of events.

When the assumed HR utilised for the initial sample size calculation is closer to the null than the true HR, the study is overpowered. Here, the power of all trial designs, is always at least as large as the target of 80%. The SSR design allowing a decrease in sample size has a smaller increase in power than the other designs, i.e., is the most efficient approach, see also Figure SM11 in the additional material 1. The probability of stopping the trial early for efficacy and futility is similar for the GSD and both the SSR designs. The expected number of events is similar for the GSD and both SSR designs, and smaller than the fixed RCT. In addition, the GSD, SSR-increase and fixed design produce a similar mean maximum number of events, whereas SSR-both consistently produces the smallest mean maximum number of events, right side of Fig. 3 and Figure SM11 in the additional material 1.

Figure 4 shows the situation where the true HR is 1, and thus, the investigational intervention is no better than the SoC. To find the optimal total maximum number of events with the SSR designs, we again utilised an optimisation prevention effect of HR = 0.68. Here, the type I error (erroneously concluding the investigational intervention to be best) of all trial designs, is at most 0.025, where all four designs have a similar type I error.

The efficacy stopping probability is low for the GSD and both SSR designs and the futility stopping probability is high. The futility stopping is consistently high (87%) for the GSD across all assumed HRs, however, it decreases from 99% to roughly 50% when the assumed HR ≥ 0.63 for both SSR designs.

The GSD consistently produces a smaller expected number of events than the fixed RCT across all assumed HRs. The SSR designs give a larger expected number of events than the fixed RCT, when the assumed HR = 0.63, but a smaller expected number of events for all other assumed HRs investigated, with the SSR-both design producing the smallest. Furthermore, the SSR-both design gives a smaller expected number of events than the GSD when the assumed HR ∈ {0.58, 0.73, 0.79}.

The GSD consistently produces a slightly larger mean maximum number of events than the fixed RCT across all assumed HRs. The SSR designs give a much larger mean maximum number of events than the fixed RCT, when the assumed HR = 0.63. The SSR-increase design gives a mean maximum number of events similar to the fixed RCT for assumed HR ≥ 0.68 and smaller than the fixed RCT for assumed HR = 0.58, while the SSR-both design produces the smallest mean maximum number of events for assumed HR ∈ {0.58, 0.73, 0.79}, Fig. 4.

Fig. 4. Comparison of Group Sequential Design (GSD) and Sample Size Re-estimation (SSR, utilising 5,000 simulations and an optimisation prevention effect, HR = 0.68) with O’Brien-Fleming boundaries against Fixed Randomised Controlled Trial (RCT) for (a) expected (Exp) number of events, (b) mean maximum (Max) number of events, (c) probability (Prob) of stopping for efficacy at the interim analysis (IA), (d) Prob of stopping for futility at the IA and (e) type I error, for a range of assumed HRs, when the true HR = 1, for desired power of 80%, with a single IA 64% of the way through the trial

Figures SM12-15 in the additional material 1 demonstrate similar results when the power of the trial is targeted to be 90%.

Discussion

In this simulation study we have explored the added value and feasibility of the GSD and SSR trial designs in comparison to a conventional fixed RCT in the setting of a trial evaluating the effectiveness of an investigational intervention vs. SoC to prevent VAP in an ICU population. Results indicate that the GSD and SSR designs are likely to decrease the expected number of events and hence, the expected sample size of a trial, reducing trial costs, when compared to the fixed RCT, while not greatly increasing the maximum number of events within the trial. To be noted, OBF boundaries gave better results than Pocock boundaries, specifically when focusing on the maximum number of events, showing the greater feasibility of utilising these boundaries. The relative gain of the adaptive designs are similar across scenarios, and for both target powers explored. SSR designs are most efficient when the true prevention effect is overestimated and hence the trial is underpowered, although, there is little difference between GSDs and SSR when the correct prevention effect is assumed. Based on these results, the application of GSD or SSR should always be considered when planning an RCT in these settings.

The general idea for the GSD is to calculate a (relatively) large sample size and hope to stop the trial early. Conversely, some authors suggest for SSR, we should start small and then increase the sample size appropriately at the IA [30]. Some authors recommend only allowing the sample size to increase at an IA [24]. There is much debate in the literature on which is the best approach GSD vs. SSR [10, 24, 30–32]. Here, we have shown that purposefully overestimating the prevention effect and underpowering a study in the SSR design is detrimental and utilising a GSD using the true prevention effect is more advantageous in the setting (utilising an interim analysis at 64% of the maximum number of events with OBF stopping boundaries) we have explored above (Table 2). Although, the placement of the IA within the SSR designs was not optimised, therefore, a later IA placement may improve their performance.

When little information about the prevention effect is available, SSR is advantageous, in case of under- or overestimation, however, choosing a sensible limit for the maximum increase in number of events is imperative to get maximal benefit from these designs. When the assumed prevention effect was overestimated, the GSD performed poorly. There was a small probability of correctly stopping the trial at IA for efficacy, however, often the trial would not stop and incorrectly fail to reject the null hypothesis at the final analysis, producing a low power. However, SSR could reach the desired power, by increasing the sample size of the second stage, if the assumed HR was not far from the true HR. When the assumed HR was further from the true HR, the trial could be stopped at IA for futility, saving time and patients, as the conditional power of the trial would not reach its target with our capped maximum increase in number of events. In this case, performing a new GSD would be more efficient, utilising the HR observed at IA to calculate the sample size required. In practice, sample size caps are frequently applied to avoid lengthy and costly trials and to ensure the trial is powered for a clinically meaningful effect. When the true prevention effect was underestimated and the trial was overpowered, SSR (allowing a decrease in sample size) was most efficient, as it could reach the desired power without large investments in second stage sample size. It produced the lowest mean maximum number of events and lowest expected number of events. However, it should be considered that trials need a minimum sample size to adequately assess safety outcomes and adverse events, which restricts the placement of the IA, which may limit the potential benefit of GSD and SSR.

When there was no difference between the intervention and SoC, the type I error was controlled at the alpha level for all adaptive designs, where due to the use of non-binding futility bounds, the GSD produces a conservative trial. These results represent the minimum type I error. Sometimes in clinical trial practice, although non-binding futility bounds are used, trials will not always stop early if these boundaries are crossed due to there being reasons other than efficacy, to continue the trial. In this case the type I error is inflated above what is found here, although it will still be controlled at the alpha level [23].

As we have optimised the total maximum number of events for the SSR designs considering the optimisation prevention effect, HR = 0.68, the expected number of events for the SSR-increase design are larger than the GSD when the assumed HR ≥ 0.63, this is due to the smaller probability of futility stopping for the SSR-increase design. Contrarily, the mean maximum number of events of the SSR-increase design are smaller than the GSD when the assumed HR ∈ {0.58, 0.68, 0.73, 0.79}, due to the different futility bounds used within the two trial designs. In addition, the expected and mean maximum number of events for the SSR-both design are smaller than the GSD when the assumed HR ∈ {0.58, 0.73, 0.79}, as the second stage sample size will decrease if it does not stop at the IA for futility. This shows the added benefit of allowing the SSR design to decrease the second stage sample size, when the assumed prevention effect is larger than the optimisation prevention effect.

To inform simulations for GSD or SSR to plan a specific RCT, it would be beneficial to have access to valid estimates of outcome incidence for the SoC, time from randomisation to outcome and recruitment rates from envisioned clinical study sites. This is the underlying paradigm of the perpetual observational studies (POSs), which have been designed within the European clinical research alliance on infectious diseases (ECRAID) [33]. ECRAID has set-up several POSs, which provide a permanent infrastructure of clinical sites practiced in recruiting patients and prospectively collecting high-quality data. This network would allow for efficient set-up of interventional studies, through availability of observational data to inform site selection, sample size calculations and simulations [34]. ECRAID’s POS-VAP is a network of 31 active sites (as of 18th July 2025) across twelve European countries [35], already including 6108 patients and 972 with VAP. These data and this network could help make future trials, like SAATELLITE [14] or AMIKINHAL [16], more efficient.

This study focused specifically on parallel group, two-arm, superiority trials to evaluate the effectiveness of a novel intervention to prevent VAP, inspired by SAATELLITE [14] and AMIKINHAL [16]. SAATELLITE concluded that suvratoxumab resulted in a trend towards a lower incidence of Staphylococcus aureus pneumonia and continued investigation of suvratoxumab in this patient population should be performed [14]. AMIKINHAL concluded that a 3-day course of inhaled amikacin reduced the burden of VAP during 28 days of follow-up, in patients who had been under IMV for at least 3 days [16]. To improve efficiency of future trials in this setting, the combination of the POS-VAP network, and informed GSD and SSR simulations should be considered.

Strengths of this work are that it was based on real-world data from HONEST-PREPS, and several different scenarios were investigated informed by results from two previous trials, SAATELITE [14] and AMIKINHAL [16]. In addition, we explored the use of these adaptive designs in the situation where our assumed prevention effect did not match the true prevention effect, a common issue for many clinical trials.

The present study also has several limitations. First, competing events have not been accounted for in the simulations. Patients at risk of developing VAP may experience a competing event such as death, ICU discharge or extubation, which means they are no longer at risk of VAP. We have minimised this limitation by utilising the cumulative incidence function to estimate the VAP rate using the HONEST-PREPS data, which did account for competing events. Furthermore, we have based this analysis on data from HONEST-PREPS, but other studies have shown differing cumulative incidences of VAP [36–38]. Thus, depending on local epidemiology and VAP definitions, VAP incidence rates and onset times may be different, which could affect the predicted benefit of these designs. Although, this further highlights the need for trials to use adaptive designs, as our prior information about the prevention effect could be incorrect.

Conclusion

In conclusion, this simulation study shows that both GSD and SSR are effective and feasible adaptive designs for RCTs aiming to compare the effectiveness of an investigational intervention against the SoC to prevent VAP. Both designs would be preferred over a fixed RCT, as the expected number of events would be smaller, with a small probability of a slight increase in maximum number of events. Application of these adaptive designs should be considered at the trial design stage, as it would result in a more efficient RCT, reducing risks for patients within the trial, and making results available to patients faster.

Supplementary Information

Additional file 1: Benefits of group sequential design and sample size re-estimation for RCTs evaluating the prevention of ventilator-associated pneumonia: A simulation study informed by real world data. Additional analysis results, for additional simulation scenarios.

Additional file 2: Benefits of group sequential design and sample size re-estimation for RCTs evaluating the prevention of ventilator-associated pneumonia: A simulation study informed by real world data. Document containing the R code which was used to produce the results in this manuscript.

Bibliography8

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1COMBACTE-NET. (February 2023) HONEST-PREPS Trial Story. Accessed 15/02/2024 at https://www.combacte.com/news/honest-preps-trial-story/
2Hospital Network Study – Preparation for a Randomized Evaluation of Anti-Pneumonia Strategies (HONEST-PREPS). Clinical Trials.gov identifier: NCT 05060718. Updated November 29, 2023. Accessed 07/01/2025. https://clinicaltrials.gov/study/NCT 05060718
3Jackson H, Sauser J, Bonten M et al. (2023, September 12–15). Clinical and microbiological epidemiology of hospital-acquired and ventilator-associated pneumonia: prospective study in 12 European IC Us [Poster presentation]. International Conference on Prevention and Infection Control, Geneva, Switzerland. https://aricjournal.biomedcentral.com/articles/supplements/volume-12-supplement-1
4US Food and Drug Administration. (June 2020) Hospital-Acquired bacterial pneumonia and ventilator associated bacterial pneumonia: developing drugs for treatment guidance for industry. FDA, Accessed 20/02/2024, https://www.fda.gov/media/79516/download
5R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/. Accessed February 28, 2024.
6Wassmer G, Pahlke F. (2024). rpact: Confirmatory Adaptive Clinical Trial Design and Analysis. R package version 3.5.1, https://www.rpact.com, https://www.github.com/rpact-com/rpact, https://www.rpact-com.github.io/rpact/ ,https://www.rpact.org Accessed February 28, 2024 .
7European Clinical Research Alliance on Infectious Diseases. (April 2023) The warm base network. Ecraid, Accessed 28/02/2024, https://ecraid.eu/sites/default/files/2023-04/The_warm_base_network_white_paper.pdf
8European Clinical Research Alliance on Infectious Diseases. POS-VAP. Ecraid, Accessed 28/10/2024, https://ecraid.eu/study/pos-vap