Incremental Intervention Effects in Studies with Dropout and Many   Timepoints

Kwangho Kim; Edward H. Kennedy; and Ashley I. Naimi

arXiv:1907.04004·stat.ME·March 16, 2022

Incremental Intervention Effects in Studies with Dropout and Many Timepoints

Kwangho Kim, Edward H. Kennedy, and Ashley I. Naimi

PDF

1 Repo

TL;DR

This paper extends incremental intervention effects to longitudinal studies with dropout and many timepoints, providing new estimators and theoretical guarantees for efficient inference in complex settings.

Contribution

It generalizes incremental intervention effects to handle multiple outcomes and dropout, deriving identifying expressions and efficient estimators with strong theoretical properties.

Findings

01

Estimators converge at fast parametric rates.

02

Incremental effects offer near-exponential gains in precision.

03

Methods are validated through simulations and applied to aspirin study.

Abstract

Modern longitudinal studies collect feature data at many timepoints, often of the same order of sample size. Such studies are typically affected by {dropout} and positivity violations. We tackle these problems by generalizing effects of recent incremental interventions (which shift propensity scores rather than set treatment values deterministically) to accommodate multiple outcomes and subject dropout. We give an identifying expression for incremental intervention effects when dropout is conditionally ignorable (without requiring treatment positivity), and derive the nonparametric efficiency bound for estimating such effects. Then we present efficient nonparametric estimators, showing that they converge at fast parametric rates and yield uniform inferential guarantees, even when nuisance functions are estimated flexibly at slower rates. We also study the variance ratio of incremental…

Tables1

Table 1. Table 1: Integrated bias and RMSE across different baselines and simulation settings.

n	$\hat{𝒃 𝒊 𝒂 𝒔} (\times 10^{- 3})$				$\hat{𝑹 𝑴 𝑺 𝑬}$				Average Dropouts (%)
n	${\hat{ψ}}_{i n c . p i}$	${\hat{ψ}}_{i n c . i p w}$	${\hat{ψ}}_{i n c . n c}$	${\hat{ψ}}_{o u r s}$	${\hat{ψ}}_{i n c . p i}$	${\hat{ψ}}_{i n c . i p w}$	${\hat{ψ}}_{i n c . n c}$	${\hat{ψ}}_{o u r s}$	Average Dropouts (%)
	14.5	24.1	30.1	9.8	1.59	2.78	2.96	1.37	50.5
1000	12.5	14.7	19.8	8.3	1.31	1.84	2.01	1.14	28.0
	10.7	11.3	9.5	7.2	1.17	1.35	1.13	0.99	8.9
	10.3	12.0	23.8	7.1	1.15	1.34	1.29	1.05	49.6
2500	10.2	10.9	14.1	6.2	1.06	1.19	1.06	0.95	27.5
	7.8	7.5	5.3	4.5	0.94	1.03	0.93	0.91	9.1

Equations417

Z = (X_{1}, A_{1}, Y_{1}, X_{2}, A_{2}, Y_{2}, ..., X_{T}, A_{T}, Y_{T}) .

Z = (X_{1}, A_{1}, Y_{1}, X_{2}, A_{2}, Y_{2}, ..., X_{T}, A_{T}, Y_{T}) .

Z = (X_{1}, A_{1}, R_{2}, R_{2} (Y_{1}, X_{2}, A_{2}), ..., R_{T}, R_{T} (Y_{T - 1}, X_{T}, A_{T}), R_{T + 1}, R_{T + 1} Y_{T})

Z = (X_{1}, A_{1}, R_{2}, R_{2} (Y_{1}, X_{2}, A_{2}), ..., R_{T}, R_{T} (Y_{T - 1}, X_{T}, A_{T}), R_{T + 1}, R_{T + 1} Y_{T})

{R_{t} = 1 \Rightarrow R_{t} = 0 \Rightarrow (R_{1}, ..., R_{t - 1}) = 1 (R_{t + 1}, ..., R_{T}) = 0,

{R_{t} = 1 \Rightarrow R_{t} = 0 \Rightarrow (R_{1}, ..., R_{t - 1}) = 1 (R_{t + 1}, ..., R_{T}) = 0,

{R_{t}, R_{t} (Y_{t - 1}, X_{t}, A_{t})}

{R_{t}, R_{t} (Y_{t - 1}, X_{t}, A_{t})}

{R_{t} (Y_{t - 1}, X_{t}, A_{t}), R_{t + 1}}

{R_{t} (Y_{t - 1}, X_{t}, A_{t}), R_{t + 1}}

q_{t} (h_{t}; δ, π_{t}) = \frac{δ π _{t} ( h _{t} )}{δ π _{t} ( h _{t} ) + 1 - π _{t} ( h _{t} )},

q_{t} (h_{t}; δ, π_{t}) = \frac{δ π _{t} ( h _{t} )}{δ π _{t} ( h _{t} ) + 1 - π _{t} ( h _{t} )},

ψ_{t} (δ) = E (Y_{t}^{\overline{Q}_{t} (δ)})

ψ_{t} (δ) = E (Y_{t}^{\overline{Q}_{t} (δ)})

ψ_{t} (δ)

ψ_{t} (δ)

q_{s} (a_{s} ∣ h_{s}, R_{s} = 1) = \frac{a _{s} δ π _{s} ( h _{s} , R _{s} = 1 ) + ( 1 - a _{s} ) { 1 - π _{s} ( h _{s} , R _{s} = 1 )}}{δ π _{s} ( h _{s} , R _{s} = 1 ) + 1 - π _{s} ( h _{s} , R _{s} = 1 )} .

q_{s} (a_{s} ∣ h_{s}, R_{s} = 1) = \frac{a _{s} δ π _{s} ( h _{s} , R _{s} = 1 ) + ( 1 - a _{s} ) { 1 - π _{s} ( h _{s} , R _{s} = 1 )}}{δ π _{s} ( h _{s} , R _{s} = 1 ) + 1 - π _{s} ( h _{s} , R _{s} = 1 )} .

Z = (X, A, R, R Y),

Z = (X, A, R, R Y),

ψ (δ) = E [\frac{δ π ( X ) μ ( X , 1 , 1 ) + { 1 - π ( X )} μ ( X , 0 , 1 )}{δ π ( X ) + { 1 - π ( X )}}]

ψ (δ) = E [\frac{δ π ( X ) μ ( X , 1 , 1 ) + { 1 - π ( X )} μ ( X , 0 , 1 )}{δ π ( X ) + { 1 - π ( X )}}]

\frac{\partial\psi(\mathbb{P}_{\epsilon})}{\partial\epsilon}\Big{|}_{\epsilon=0}=\int\phi(z;\mathbb{P})\left(\frac{\partial\log d\mathbb{P}_{\epsilon}(z)}{\partial\epsilon}\right)\Big{|}_{\epsilon=0}\ d\mathbb{P}(z)

\frac{\partial\psi(\mathbb{P}_{\epsilon})}{\partial\epsilon}\Big{|}_{\epsilon=0}=\int\phi(z;\mathbb{P})\left(\frac{\partial\log d\mathbb{P}_{\epsilon}(z)}{\partial\epsilon}\right)\Big{|}_{\epsilon=0}\ d\mathbb{P}(z)

\displaystyle\sum_{s=1}^{t}\left(\frac{1}{\delta A_{s}+1-A_{s}}\right)\Bigg{[}\frac{\left\{m_{s}(H_{s},1)-m_{s}(H_{s},0)\right\}\delta(A_{s}-\pi_{s}(H_{s}))\omega_{s}(H_{s},A_{s})}{\delta\pi_{s}(H_{s})+1-\pi_{s}(H_{s})}

\displaystyle\sum_{s=1}^{t}\left(\frac{1}{\delta A_{s}+1-A_{s}}\right)\Bigg{[}\frac{\left\{m_{s}(H_{s},1)-m_{s}(H_{s},0)\right\}\delta(A_{s}-\pi_{s}(H_{s}))\omega_{s}(H_{s},A_{s})}{\delta\pi_{s}(H_{s})+1-\pi_{s}(H_{s})}

\displaystyle\quad+\begin{pmatrix}\delta m_{s}(H_{s},1)\left\{\pi_{s}(H_{s})\omega_{s}(H_{s},A_{s})-A_{s}R_{s+1}\right\}\\ +m_{s}(H_{s},0)\left\{(1-\pi_{s}(H_{s}))\omega_{s}(H_{s},A_{s})-(1-A_{s})R_{s+1}\right\}\end{pmatrix}\Bigg{]}

\times k = 1 \prod s {\frac{δ A _{k} + 1 - A _{k}}{δ π _{k} ( H _{k} ) + 1 - π _{k} ( H _{k} )} \cdot \frac{R _{k}}{ω _{k} ( H _{k} , A _{k} )}} + s = 1 \prod t {\frac{δ A _{s} + 1 - A _{s}}{δ π _{s} ( H _{s} ) + 1 - π _{s} ( H _{s} )} \cdot \frac{R _{s}}{ω _{s} ( H _{s} , A _{s} )}} Y_{t} R_{t + 1},

m_{s}

m_{s}

= \int_{R_{s}} μ (h_{t}, a_{t}, R_{t + 1} = 1) k = s + 1 \prod t q_{k} (a_{k} ∣ h_{k}, R_{k} = 1) d ν (a_{k}) d P (y_{k - 1}, x_{k} ∣ h_{k - 1}, a_{k - 1}, R_{k} = 1)

η = (π, m, ω) = (π_{1}, ..., π_{t}, m_{1}, ..., m_{t}, ω_{1}, ..., ω_{t}),

η = (π, m, ω) = (π_{1}, ..., π_{t}, m_{1}, ..., m_{t}, ω_{1}, ..., ω_{t}),

\hat{ψ}_{in c . p i} (t; δ) = P_{n} {φ (Z; \hat{η}, δ, t)}

\hat{ψ}_{in c . p i} (t; δ) = P_{n} {φ (Z; \hat{η}, δ, t)}

\hat{ψ}_{in c . i pw} (t; δ) = P_{n} {s = 1 \prod t (\frac{δ A _{s} + 1 - A _{s}}{δ π ^ _{s} ( H _{s} ) + 1 - π ^ _{s} ( H _{s} )} \cdot \frac{\mathbbm 1 ( R _{s + 1} = 1 )}{ω ^ _{s} ( H _{s} , A _{s} )}) Y_{t}} .

\hat{ψ}_{in c . i pw} (t; δ) = P_{n} {s = 1 \prod t (\frac{δ A _{s} + 1 - A _{s}}{δ π ^ _{s} ( H _{s} ) + 1 - π ^ _{s} ( H _{s} )} \cdot \frac{\mathbbm 1 ( R _{s + 1} = 1 )}{ω ^ _{s} ( H _{s} , A _{s} )}) Y_{t}} .

ψ_{t} (δ) = P_{n} {φ (Z; \hat{η}_{- S}, δ, t)} \equiv \frac{1}{K} k = 1 \sum K P_{n}^{(k)} {φ (Z; \hat{η}_{- k}, δ, t)}

ψ_{t} (δ) = P_{n} {φ (Z; \hat{η}_{- S}, δ, t)} \equiv \frac{1}{K} k = 1 \sum K P_{n}^{(k)} {φ (Z; \hat{η}_{- k}, δ, t)}

\frac{ψ ^ _{t} ( δ ) - ψ _{t} ( δ )}{σ ^ ( t , δ ) / n} ⇝ G (δ, t)

\frac{ψ ^ _{t} ( δ ) - ψ _{t} ( δ )}{σ ^ ( t , δ ) / n} ⇝ G (δ, t)

ψ_{t} (δ) \pm z_{1 - α /2} \frac{σ ^ ^{2} ( δ , t )}{n}

ψ_{t} (δ) \pm z_{1 - α /2} \frac{σ ^ ^{2} ( δ , t )}{n}

P (δ \in D, 1 \leq s \leq t sup \frac{ψ _{s} ( δ ) - ψ _{s} ( δ )}{σ ( δ , s ) / n} \leq c_{α}) = 1 - α + o (1) .

P (δ \in D, 1 \leq s \leq t sup \frac{ψ _{s} ( δ ) - ψ _{s} ( δ )}{σ ( δ , s ) / n} \leq c_{α}) = 1 - α + o (1) .

ψ_{a t} = t = 1 \prod T (\frac{A _{t}}{p}) Y

ψ_{a t} = t = 1 \prod T (\frac{A _{t}}{p}) Y

ψ_{in c} = t = 1 \prod T (\frac{δ A _{t} + 1 - A _{t}}{δ p + 1 - p}) Y

ψ_{in c} = t = 1 \prod T (\frac{δ A _{t} + 1 - A _{t}}{δ p + 1 - p}) Y

C_{T} [{\frac{δ ^{2} p ^{2} + p ( 1 - p )}{( δ p + 1 - p ) ^{2}}}^{T} - p^{T}] \leq \frac{Var ( ψ _{in c} )}{Var ( ψ _{a t} )} \leq C_{T} ζ (T; p) {\frac{δ ^{2} p ^{2} + p ( 1 - p )}{( δ p + 1 - p ) ^{2}}}^{T}

C_{T} [{\frac{δ ^{2} p ^{2} + p ( 1 - p )}{( δ p + 1 - p ) ^{2}}}^{T} - p^{T}] \leq \frac{Var ( ψ _{in c} )}{Var ( ψ _{a t} )} \leq C_{T} ζ (T; p) {\frac{δ ^{2} p ^{2} + p ( 1 - p )}{( δ p + 1 - p ) ^{2}}}^{T}

Var (ψ_{in c}) < Var (ψ_{a t})

Var (ψ_{in c}) < Var (ψ_{a t})

min {T : [\frac{δ ^{2} p + 1 - p}{( δ p + 1 - p ) ^{2}}]^{T} - \frac{c _{1}}{p ^{T}} + 2 < 0} where c_{1} = \frac{E [ ( Y ^{\overline{1}_{T}} ) ^{2} ]}{b _{u}^{2}} .

min {T : [\frac{δ ^{2} p + 1 - p}{( δ p + 1 - p ) ^{2}}]^{T} - \frac{c _{1}}{p ^{T}} + 2 < 0} where c_{1} = \frac{E [ ( Y ^{\overline{1}_{T}} ) ^{2} ]}{b _{u}^{2}} .

X_{t} = (X_{1, t}, X_{2, t}) \sim N (0, I)

X_{t} = (X_{1, t}, X_{2, t}) \sim N (0, I)

\pi_{t}(H_{t})=expit\Big{(}2\sum_{s=t-2}^{t-1}\left(A_{s}-1/2\right)\Big{)}

\pi_{t}(H_{t})=expit\Big{(}2\sum_{s=t-2}^{t-1}\left(A_{s}-1/2\right)\Big{)}

\left(Y\big{|}\overline{X}_{t},\overline{A}_{t}\right)\sim N\big{(}\mu(\overline{X}_{t},\overline{A}_{t}),1\big{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kwangho-joshua-kim/Incremental-dropout
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDropout

Full text

\doparttoc\faketableofcontents

Incremental Intervention Effects

in Studies with Dropout

and Many Timepoints

Kwangho Kim, Edward H. Kennedy, Ashley I. Naimi

{Department of Statistics & Data Science, Machine Learning Department}, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213. Email: [email protected] Department of Statistics & Data Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213. Email: [email protected] Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA, USA; E-mail: [email protected]

Abstract

Modern longitudinal studies collect feature data at many timepoints, often of the same order of sample size. Such studies are typically affected by dropout and positivity violations. We tackle these problems by generalizing effects of recent incremental interventions (which shift propensity scores rather than set treatment values deterministically) to accommodate multiple outcomes and subject dropout. We give an identifying expression for incremental intervention effects when dropout is conditionally ignorable (without requiring treatment positivity), and derive the nonparametric efficiency bound for estimating such effects. Then we present efficient nonparametric estimators, showing that they converge at fast parametric rates and yield uniform inferential guarantees, even when nuisance functions are estimated flexibly at slower rates. We also study the variance ratio of incremental intervention effects relative to more conventional deterministic effects in a novel infinite time horizon setting, where the number of timepoints can grow with sample size, and show that incremental intervention effects yield near-exponential gains in statistical precision in this setup. Finally we conclude with simulations and apply our methods in a study of the effect of low-dose aspirin on pregnancy outcomes. 00footnotetext: Implementation of our method is publicly available at https://github.com/kwangho-joshua-kim/Incremental-dropout

Keywords: causal inference, time-varying confounding, right-censoring, longitudinal study, positivity

1 Introduction

Causal inference has long been an important scientific pursuit, and understanding causal relationships is essential across many disciplines. However, for practical and ethical reasons, causal questions cannot always be evaluated via experimental methods (i.e., randomized trials), making observational studies the only viable alternative. Further, when individuals can be exposed to varying treatment levels over time, collecting appropriate longitudinal data is important. To that end, recent technological advancements that facilitate data collection are making longitudinal studies with a very large number of time points (sometimes of the same order of sample size) increasingly common (e.g., Kumar et al., 2013; Eysenbach et al., 2011; Klasnja et al., 2015).

The increase in observational studies with detailed longitudinal data has also introduced numerous statistical challenges that remain unaddressed. For longitudinal causal studies, two analytic frameworks are often invoked: effects of deterministic fixed interventions (Robins, 1986; Robins et al., 2000; Hernán et al., 2000), in which all individuals are assigned to a fixed exposure level over all time-points; and effects of deterministic dynamic interventions (Murphy et al., 2001; Robins, 2004) in which, at each time, treatment is assigned according to a fixed rule that depends on past history. In the real world, fixed deterministic interventions might not be of practical interest since the treatment typically cannot be applied uniformly across a population (Kennedy, 2019).

Generally, deterministic interventions (fixed or dynamic) rely on a positivity assumption, which requires every unit to have a nonzero chance of receiving each of the available treatments at every time point. If the positivity assumption is violated, the causal effect of deterministic (fixed or dynamic) interventions will be no longer identifiable. Even under positivity, longitudinal studies are especially prone to the curse of dimensionality, since exponentially many samples are needed to learn about all treatment trajectories. These issues only worsen when the number of timepoints or covariates increases. Thus, due to a lack of sufficiently flexible analytic methods for longitudinal data, researchers are often forced to either rely on strong parametric assumptions, or forego the estimation of causal effects altogether (e.g. Kumar et al., 2013).

One strategy to address such issues in deterministic interventions is to consider stochastic interventions that depend on the observational treatment process and thus are random at each timepoint (e.g., van der Laan and Petersen, 2007; Young et al., 2014; Díaz and van der Laan, 2012; Haneuse and Rotnitzky, 2013; Moore et al., 2012). Recently, Kennedy (2019) proposed novel incremental intervention effects which quantify effects of shifting treatment propensities, rather than effects of setting treatment to fixed values. Importantly, incremental effect estimators do not require positivity, and can still achieve $\sqrt{n}$ rates with flexible nonparametric methods. Despite these strengths, the method has not yet been adapted to general longitudinal studies where multiple right-censored outcomes are common (as is common in studies with human subjects). The right-censored outcomes can result in biased estimates of incremental intervention effects unless properly adjusted. This is akin to the well-known concept of confounding bias, and will likely be amplified over time in our case. However, extension to the right censoring setup for incremental intervention effects is not straightforward as, for example, it requires computing new remainder terms to construct the estimators.

In this paper we propose a more comprehensive form of incremental intervention effects that accommodate not only time-varying treatments, but time-varying outcomes subject to right censoring (i.e., dropout). We provide an identifying expression for incremental intervention effects when dropout is conditionally ignorable, still without requiring (treatment) positivity, and derive the nonparametric efficiency bound for estimating such effects. We go on to present efficient nonparametric estimators, showing that they converge at fast rates and give uniform inferential guarantees, even when nuisance functions are estimated at much slower rates with flexible machine learning tools. Importantly, we study the variance ratio of incremental effects to more conventional deterministic effects in a novel infinite time horizon setting, where the number of timepoints can grow with sample size to infinity. We specifically show that incremental intervention effects can reduce the variance near exponentially, thus yielding extraordinary gains in statistical precision in this setup. Finally, we conduct a simulation study and show that our proposed methods can successfully adjust for subject dropout in incremental intervention effects, and apply our methods to a longitudinal study of the effect of low-dose aspirin on pregnancy outcomes.

2 Setup

We consider a study where for each subject we observe covariates $X_{t}\in\mathbb{R}^{d}$ , treatment $A_{t}\in\mathbb{R}$ , and outcome $Y_{t}\in\mathbb{R}$ , with all variables allowed to vary over time $t$ , but where subjects can drop out or be lost to follow-up. In particular, we consider the case where we want to observe a sample of i.i.d observations $(Z_{1},...,Z_{n})$ from a probability distribution $\mathbb{P}$ with, for those subjects who remain in the study up to the final timepoint $t=T$ ,

[TABLE]

But in general we only get to observe

[TABLE]

where $R_{t}=\mathbbm{1}\text{\{ still in the study at time t\}}$ is an indicator for whether the subject contributes data at time $t$ . We write $R_{t}(Y_{t-1},X_{t},A_{t})$ as a shorthand for $(R_{t}Y_{t-1},R_{t}X_{t},R_{t}A_{t})$ , so in the missingness process that we consider, subjects can drop out at each time after the measurement of covariates/treatment. This is motivated by the fact that this is likely the most common type of dropout, since outcomes $Y_{t}$ at time $t$ are often measured together with or just prior to covariates $X_{t+1}$ at time $t+1$ . As we consider a monotone dropout (i.e., right-censoring) process, $R_{t}$ is non-increasing in time $t$ , i.e.,

[TABLE]

where $\bm{0},\bm{1}$ are vectors of zeros and ones respectively. Thus our data structure $Z$ is a chain with $t$ -th component

[TABLE]

for $t=1,...,T+1$ , where $R_{1}=1$ and we do not use $Y_{0}$ or $X_{T+1},A_{T+1}$ . Although we suppose each subject’s dropout will occur before the $t$ -th stage, our data structure also covers the case when the dropout will occur after the $t$ -th stage because in that case we can write

[TABLE]

as the $t$ -th component of our chain.

For simplicity, we consider binary treatment in this paper, so that the support of each $A_{t}$ is $\mathcal{A}=\{0,1\}$ . We use overbars and underbars to denote all the past history and future event of a variable respectively, so that $\overline{X}_{t}=(X_{1},...,X_{t})$ and $\underline{A}_{t}=(A_{t},...,A_{T})$ for example. We also write $H_{t}=(\overline{X}_{t},\overline{A}_{t-1},\overline{Y}_{t-1})$ to denote all the observed past history just prior to receiving treatment at time $t$ , with support $\mathcal{H}_{t}$ . Finally, we use lower-case letters $a_{t},h_{t},x_{t}$ to represent realized values for $A_{t},H_{t},X_{t}$ , unless stated otherwise.

Now that we have defined our data structure we turn to our estimation goal, i.e., which treatment effects we aim to estimate. We use $Y_{t}^{\overline{a}_{t}}$ to denote the potential (counterfactual) outcome at time $t$ that would have been observed under a treatment sequence $\overline{a}_{t}=(a_{1},...,a_{t})$ (note that we have $Y_{t}^{\overline{a}_{T}}=Y_{t}^{\overline{a}_{t}}$ as long as the future cannot cause the past). In longitudinal causal problems it is common to pursue quantities such as $\mathbb{E}(Y_{t}^{\overline{a}_{t}})$ , i.e., the mean outcome at a given time under particular treatment sequences $\overline{a}_{t}$ ; for example one might compare the mean outcome under $\overline{a}_{t}=\bm{1}$ versus $\overline{a}_{t}=\bm{0}$ , which represents how outcomes would change if all versus none were treated at all times. However identifying these effects requires strong positivity assumptions (i.e., that all have some chance at receiving every treatment at every time), and estimating these effects often requires untenable parametric assumptions especially when $t\gg 1$ .

Following Kennedy (2019) we instead consider incremental intervention effects, which represent how mean outcomes would change if the odds of treatment at each time were multiplied by a factor $\delta$ (e.g., $\delta=2$ means odds of treatment are doubled). Incremental interventions shift propensity scores rather than impose treatments themselves; they represent what would happen if treatment were gradually more or less likely to be assigned, relative to the natural/observational treatment, in the population. Since they are ‘population-level’ effects, they are useful for giving an interpretable picture to understand the overall societal effects, but will likely be less useful than classical deterministic effects for making specific recommendations about optimal treatment. Nonetheless, there are a number of benefits of studying incremental intervention effects: for example, positivity assumptions can be entirely and naturally avoided; complex effects under a wide range of intensities can be summarized with a single curve in $\delta$ , no matter how many timepoints $T$ there are; and they more closely align with actual intervention effects than their fixed treatment regime counterparts. We refer to Kennedy (2019) for more discussion and details on the tradeoff between deterministic and incremental intervention effects.

Formally, incremental interventions are dynamic stochastic interventions where treatment is assigned based on new interventional propensity scores defined by

[TABLE]

not the observational propensity scores $\pi_{t}(h_{t})=\mathbb{P}(A_{t}=1\mid H_{t}=h_{t})$ . In other words, $q_{t}$ is a shifted version of $\pi_{t}$ obtained by multiplying the odds of receiving treatment by $\delta$ . We denote potential outcomes under the above intervention as $Y_{t}^{\overline{Q}_{t}(\delta)}$ where $\overline{Q}_{t}(\delta)=\{Q_{1}(\delta),...,Q_{t}(\delta)\}$ represents a sequence of draws from the conditional distributions $Q_{s}(\delta)\mid H_{s}=h_{s}\sim\text{Bernoulli}\{q_{s}(h_{s};\delta,\pi_{s})\}$ , $s=1,...,t$ . We often drop $\delta$ and write $Q_{t}=Q_{t}(\delta)$ when the dependence is clear from the context. Note here we use capital letters for the intervention indices since they are random, as opposed to $Y_{t}^{\overline{a}_{t}}$ where the intervention is deterministic. Therefore in this paper, we aim to estimate the mean counterfactual outcome

[TABLE]

for any $t\leq T$ . In the next section we describe the necessary conditions for identifying $\psi_{t}(\delta)$ in the presence of dropout.

Remark 1.

To be precise, the incremental effect $\psi_{t}(\delta)$ is the compounding effect by the two different changes. Consider only the first two timepoints. In this case the propensity score under the incremental intervention at the later timepoint will be different from its observational value for two reasons: 1) $\delta$ multiplied to the propensity scores, and 2) covariates at the earlier timepoint that have been changed by the resultant (incremental) intervention. With many timepoints, in a long term, these effects are compounded over time and just manifested as a single number of the incremental effect. This nuance stems from the nature of incremental interventions, i.e., the way they depend on the observational treatment process through $q_{t}$ .

3 Identification

In this section, we will give assumptions under which the entire marginal distribution of the counterfactual outcome $Y_{t}^{\overline{Q}_{t}(\delta)}$ is identified. Specifically, we require the following assumptions for all $t\leq T$ .

Assumption A1.

$Y_{t}=Y_{t}^{\overline{a}_{t}}$ * if $\overline{A}_{t}=\overline{a}_{t}$ *

Assumption A2-E.

$A_{t^{\prime}}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{t}^{\overline{a}_{t}}\mid H_{t^{\prime}}$ , $\forall t^{\prime}\leq t$

Assumption A2-M.

$R_{t}\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(\underline{X}_{t},\underline{A}_{t},\underline{Y}_{t-1})\mid H_{t-1},A_{t-1},R_{t-1}=1$ **

Assumption A3.

$\mathbb{P}(R_{t}=1\mid H_{t-1},A_{t-1},R_{t-1}=1)>\epsilon_{\omega}$ * for some $\epsilon_{\omega}>0$ a.s.*

Assumptions (A1) and (A2-E) correspond to consistency and exchangeability (or sequential ignorability) respectively, which are commonly adopted in the literature. Consistency means that the observed outcomes are equal to the corresponding potential outcomes under the observed treatment sequence, and would be violated in settings with interference for example. Exchangeability means that the treatment and counterfactual outcome are independent, conditional on the observed past (if there were no dropout), i.e., that treatment is as good as randomized at each time conditional on the past. Experiments ensure that exchangeability holds by construction.

In our work, we additionally require assumptions (A2-M) and (A3) because of the missingness/dropout. (A2-M) is the standard time-varying missing-at-random (MAR) assumption for monotone missingness, ensuring that dropout is independent of the future conditioned on the observed history up to the current time point (e.g., Council et al., 2010; Robins et al., 1995; van der Laan and Robins, 2003). One may think of this type of MAR assumption as a sequentially random dropout process, where the decision to drop out at time $t$ is like the flip of a coin, with probability of ‘heads’ (dropout) depending only on the measurements recorded through time $t-1$ (Council et al., 2010, Chapter 4). This would be a reasonable assumption if we can collect enough data to explain the dropout process, so we can ensure that those who dropout look like those who do not, given all past observed data. (A3) is a positivity assumption for missingness, meaning that each subject in the study has some non-zero chance at staying in the study at the next timepoint. This would be expected to hold in many studies, but may not if some subjects are ‘doomed’ to drop out based on their specific measured characteristics.

Importantly, here we do not require positivity conditions on the propensity scores as we are targeting the effects $\psi_{t}(\delta)$ with the incremental intervention $q_{t}$ defined in (2), not deterministic effects. The next result gives an identifying expression for $\psi_{t}(\delta)$ under the above assumptions.

Theorem 3.1.

Suppose identification assumptions (A1) - (A3) hold. Then for all $t\leq T$ , the incremental effect on outcome $Y_{t}$ with given value of $\delta\in[\delta_{l},\delta_{u}]$ , $0<\delta_{l}\leq\delta_{u}<\infty$ , equals

[TABLE]

*where $\overline{\mathcal{X}}_{t}=\mathcal{X}_{1}\times\cdots\times\mathcal{X}_{t}$ , $\overline{\mathcal{A}}_{t}=\mathcal{A}_{1}\times\cdots\times\mathcal{A}_{t}$ ,

$\mu(h_{t},a_{t},R_{t+1}=1)=\mathbb{E}(Y_{t}\mid H_{t}=h_{t},A_{t}=a_{t},R_{t+1}=1)$ , and*

[TABLE]

Here, $\pi_{s}(h_{s},R_{s}=1)=\mathbb{P}(A_{s}=1\mid H_{s}=h_{s},R_{s}=1)$ and $\nu$ is some dominating measure for the distribution of $A_{s}$ .

When we derive the identification result in Theorem 3.1, as in Kennedy (2019) we use the g-formula (Robins, 1986) where we put in the incremental intervention for the treatment distribution and a point mass for the right-censoring indicator of $1$ , followed by applying the identification lemma (Lemma F.1 in the appendix) under the additional assumptions (A2-M) and (A3). The next corollary illustrates what this identification result gives in the simple point-exposure study.

Corollary 3.1.

When $T=1$ , the data structure reduces to

[TABLE]

thus in this case $R=1$ means the outcome is not missing. Then the identifying expression simplifies to

[TABLE]

where $\pi(X)=\mathbb{P}(A=1\mid X)$ and $\mu(x,a,1)=\mathbb{E}(Y\mid X=x,A=a,R=1)$ .

Therefore when $T=1$ , the effect $\psi(\delta)$ is simply a weighted average of the two regression functions $\mu(X,1,1)$ , $\mu(X,0,1)$ among those with observed outcomes, with weights depending on the propensity scores and $\delta$ .

4 Efficiency Theory

In the previous section, we showed that the incremental intervention effect adjusted for subject dropout can be identified without requiring any positivity conditions on the treatment process. Our main goal in this section is to develop a nonparametric efficiency theory based on the efficient influence function for $\psi_{t}(\delta)$ .

The efficient influence function plays a crucial role in non/semiparametric efficiency theory because 1) its variance gives an asymptotic efficiency bound, and 2) its form indicates how to do an appropriate bias correction in order to construct estimators that attain such efficiency bound. Mathematically, given a target parameter $\psi$ an influence function $\phi$ acts as the derivative term in a distributional analog of a Taylor expansion, which can be seen to imply

[TABLE]

for all smooth parametric submodels $\mathbb{P}_{\epsilon}$ containing the true distribution at $\epsilon=0$ , i.e., $\mathbb{P}_{\epsilon=0}=\mathbb{P}$ . Of all the influence functions, the efficient influence function is defined as the one which gives the greatest lower bound of all parametric submodel $\mathbb{P}_{\epsilon}$ , so giving the efficiency bound for estimating $\psi$ . For more details we refer to Section D in the appendix and references therein (Bickel et al., 1998; Vaart, 1998; van der Laan and Robins, 2003; Tsiatis, 2006; Kennedy, 2016).

The next theorem gives an expression for the efficient influence function for our incremental effect $\psi_{t}(\delta)$ , under a nonparametric model.

Theorem 4.1.

The (uncentered) efficient influence function for the intervention effect $\psi_{t}(\delta)$ , $\forall t\leq T$ , is given by

[TABLE]

where $\pi_{s}(h_{s})=\mathbb{P}(A_{s}=1\mid H_{s}=h_{s},R_{s}=1)$ , $\omega_{s}(H_{s},A_{s})=d\mathbb{P}(R_{s+1}=1\mid H_{s},A_{s},R_{s}=1)$ , and

[TABLE]

for $\forall s\leq t$ . Here $\mathcal{R}_{s}=(\overline{\mathcal{X}}_{t}\times\overline{\mathcal{A}}_{t})\setminus(\overline{\mathcal{X}}_{s}\times\overline{\mathcal{A}}_{s})$ , $\mu(h_{t},a_{t},R_{t+1}=1)=\mathbb{E}(Y_{t}\mid H_{t}=h_{t},A_{t}=a_{t},R_{t+1}=1)$ , and $\nu$ is a dominating measure for the distribution of $A_{k}$ .

The proof is given in Appendix F.2. This result will be used to construct an efficient, model-free estimator for our new incremental intervention effects in the next section. In Theorem 4.1 all terms are to be estimated via regression tools or simply obtained from the observed data. Note that we have new weighting terms such as $\frac{\mathbbm{1}\left(R_{s}=1\right)}{\omega_{s}(H_{s},A_{s})}$ that are used to adjust for dropout effects at each stage $s\leq t$ . As one may expect, should all the data be fully observed (i.e., $\mathbb{P}[R_{t}=0]=1$ a.e $[\mathbb{P}]$ for all $t\leq T$ ), both the identifying expression and efficient influence function will reduce to the formulas presented in Kennedy (2019).

Remark 2.

Although we derived the above efficient influence function from first principles, based on the pathwise differentiability in (5), it could equivalently be derived using results on mapping complete- to observed-data influence functions under general coarsening at random (e.g., Robins et al., 1994, 1995; van der Laan and Robins, 2003; Tsiatis, 2006). However, in either case, computing error bounds requires the derivation of the second-order remainder terms in von Mises expansion, which is new in our work and not immediate from the earlier results.

The above efficient influence function involves three types of nuisance functions: the treatment propensity scores $\pi_{s}(H_{s})$ , the missingness/dropout propensity scores $\omega_{s}(H_{s},A_{s})$ , and the psuedo outcome regression functions $m_{s}(H_{s},A_{s},R_{s+1}=1)$ , $\forall s\leq t$ . As in Kennedy (2019), each $m_{s}$ can be estimated through sequential regressions without resorting to complicated conditional density estimation, since they are marginalized versions of the full regression functions $\mu(h_{s},a_{s},R_{s+1}=1)$ that condition on all in the past. We give the sequential regression formulation for $m_{s}$ in Appendix E.1.

The efficient influence function corresponding to $T=1$ follows a relatively simple and intuitive form, equaling a weighted average of the efficient influence functions for $\mathbb{E}(Y^{1})$ and $\mathbb{E}(Y^{0})$ plus contributions from the propensity scores $\omega_{s},\pi_{s}$ . We give this result in Appendix E.2 as well.

5 Estimation and Inference

5.1 Proposed Estimator

In this section we develop an estimator that can attain fast $\sqrt[]{n}$ rates, even when other nuisance functions are modeled nonparametrically and estimated at slower rates.

To begin, let $\varphi(Z;\bm{\eta},\delta,t)$ denote the uncentered efficient influence function from Theorem 4.1, which is a function of $Z$ , indexed by a set of nuisance functions

[TABLE]

$\delta$ , and $t\leq T$ , where $\pi_{t},m_{t},\omega_{t}$ are the same nuisance functions defined in Theorem 4.1.

Since $\mathbb{E}[\varphi(Z;\bm{\eta},\delta,t)]=\psi_{t}(\delta)$ , a natural estimator would be the naive plug-in $Z$ -estimator

[TABLE]

where $\hat{\bm{\eta}}$ represents a set of nuisance function estimates and $\mathbb{P}_{n}$ denotes the empirical measure so that sample averages can be written by $\frac{1}{n}\sum_{i}f(Z_{i})=\mathbb{P}_{n}\{f(Z)\}=\int f(z)d\mathbb{P}_{n}(z)$ .

If we assume $\pi_{t}$ and $\omega_{t}$ were correctly parametrically modeled, then one could use the following simple inverse-probability-weighted (IPW) estimator

[TABLE]

Note that this IPW estimator is a special case of $\hat{\psi}_{inc.pi}$ where $\hat{m}_{s}$ is set to zero for all $s\leq t$ .

However, the above inverse-weighted or plug-in $Z$ -estimators typically require both strong parametric assumptions and empirical process conditions (e.g., Donsker-type or low entropy conditions) that restrict the flexibility of the nuisance estimators. Especially, the latter is due to using the data twice (once for estimating the nuisance functions, again for estimating the bias, i.e., the average of the uncentered influence function), thus can cause overfitting. To avoid this downside and make our estimator more practically useful, here we use sample splitting (Zheng and Laan, 2010; Chernozhukov et al., 2016; Kennedy, 2019; Robins and Hernán, 2008). As will be seen shortly, sample splitting allows us to avoid complex empirical process conditions even when all the nuisance functions $\bm{\eta}$ are arbitrarily flexibly estimated. Further, bias-corrected influence function-based estimators allow us to withstand slower rates for nuisance estimation while attaining faster rates for estimation of the parameter of interest.

Now we give an algorithm allowing slower than $\sqrt{n}$ rates and non-Donsker complex nuisance estimation as follows. First, we randomly split the observations $(Z_{1},...,Z_{n})$ into $K$ disjoint groups, using a random variable $S_{i}$ , $i=1,...,n$ , drawn independently of the data, where each $S_{i}\in\{1,...,K\}$ denotes the group membership for unit $i$ . Then our proposed estimator is given by

[TABLE]

where we let $\mathbb{P}_{n}^{(k)}$ denote sample averages only over a group $k$ , i.e., $\{i:S_{i}=k\}$ , and let $\hat{\bm{\eta}}_{-k}$ denote the nuisance estimator constructed excluding the group $k$ . We detail exactly how to compute the proposed estimator $\widehat{\psi}_{t}(\delta)$ in Appendix A.

Our methods effectively utilize all the observed samples available at each time, without any need for discarding a subset of observed sample in advance. It is also worth noting that our algorithm is amenable to parallelization due to the sample splitting.

5.2 Asymptotic Theory

This subsection is devoted to characterizing an asymptotic behavior of our proposed estimator, that $\widehat{\psi}_{t}(\delta)$ is $\sqrt[]{n}$ -consistent and asymptotically normal even when the nuisance functions are estimated nonparametrically at much slower than $\sqrt{n}$ rates.

In what follows we denote the $L_{2}(\mathbb{P})$ norm of function $f$ by $\|f\|=\left(\int f(z)^{2}d\mathbb{P}(z)\right)^{1/2}$ , to distinguish it from the ordinary $L_{2}$ norm $\|\cdot\|_{2}$ for a fixed vector. Also note that although we used $m_{s}$ to denote the pseudo-regression function defined in Theorem 4.1, in principle they are indexed by both the time $s$ and increment parameter $\delta$ as in $m_{s,\delta}$ . The next theorem shows uniform convergence of $\hat{\psi}_{t}(\delta)$ , which lays the foundation for subsequent statistical inferential and testing procedures.

Theorem 5.1.

Define the variance function as $\sigma^{2}(\delta,t)=\mathbb{E}\left[\left(\varphi(Z;\bm{\eta},\delta,t)-\psi_{t}(\delta)\right)^{2}\right]$ and let $\hat{\sigma}^{2}(\delta,t)=\mathbb{P}_{n}\left[\left(\varphi(Z;\hat{\bm{\eta}}_{-S},\delta,t)-\hat{\psi}_{t}(\delta)\right)^{2}\right]$ denote its estimator. Assume:

The set $\mathcal{D}=[\delta_{l},\delta_{u}]$ is bounded with $0<\delta_{l}\leq\delta_{u}<\infty$ .

2)

$\mathbb{P}\left[\mid m_{s}(H_{s},A_{s},R_{s+1}=1)\mid\leq C\right]=\mathbb{P}\left[\mid\hat{m}_{s}(H_{s},A_{s},R_{s+1}=1)\mid\leq C\right]=1$ , $\forall s\leq t$ , for some constant $C<\infty$ .

3)

$\sup_{\delta\in\mathcal{D}}\big{|}\frac{\hat{\sigma}^{2}(\delta,t)}{\sigma^{2}(\delta,t)}-1\big{|}=o_{\mathbb{P}}(1)$ , and $\|\sup_{\delta\in\mathcal{D}}\mid\varphi(Z;\bm{\eta},\delta,t)-\varphi(Z;\hat{\bm{\eta}}_{-S},\delta,t)|\|=o_{\mathbb{P}}(1)$ .

4)

$\left(\underset{\delta\in\mathcal{D}}{sup}\|m_{\delta,s}-\widehat{m}_{\delta,s}\|+\|\pi_{s}-\widehat{\pi}_{s}\|\right)\Big{(}\|\widehat{\pi}_{r}-{\pi}_{r}\|+\|\widehat{\omega}_{r}-{\omega}_{r}\|\Big{)}=o_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}\right)$ , $\forall r\leq s\leq t$ .

Then we have

[TABLE]

*in $l^{\infty}(\mathcal{D})$ , where $\mathbb{G}$ is a mean-zero Gaussian process with covariance

$\mathbb{E}[\mathbb{G}(\delta_{1},t_{1})\mathbb{G}(\delta_{2},t_{2})]=\mathbb{E}\left[\widetilde{\varphi}(Z;\bm{\eta},\delta_{1},t_{1})\widetilde{\varphi}(Z;\bm{\eta},\delta_{2},t_{2})\right]$ and $\widetilde{\varphi}(Z;\bm{\eta},\delta,t)=\frac{\varphi(Z;\bm{\eta},\delta,t)-\psi_{t}(\delta)}{\sigma(\delta,t)}$ .*

A proof of the above theorem can be found in Appendix F.5. We also analyze the second order remainders for the efficient influence function, and keep the intervention distribution completely general (see Lemma F.2, F.7, F.8 in the appendix). Therefore, one may apply our results to studies of other stochastic interventions under missingness/dropout as well.

Assumptions 1), 2) and 3) in Theorem 5.1 are all quite weak. Assumptions 1) and 2) are mild boundedness conditions, where assumption 2) could be further relaxed at the expense of a less simple proof, for example using bounds on $L_{p}$ norms. Assumption 3) is also a mild consistency assumption, with no requirement on rate of convergence. The main substantive assumption is Assumption 4), which requires that the product of nuisance estimation errors must vanish at fast enough rates. One sufficient condition for this is that all the nuisance functions are consistently estimated at a rate of $n^{1/4}$ or faster.

Lowering the bar from $\sqrt[]{n}$ to $n^{1/4}$ indeed allows us to employ a richer set of modern machine learning tools, since such rates are attainable under diverse structural constraints (e.g., Yang et al., 2015; Raskutti et al., 2012; Györfi et al., 2006). In this paper, however, we are agnostic about how such rates should be attained. In practice, we may want to consider using different estimation techniques for each of $\bm{\pi,m,\omega}$ based on our prior knowledge and descriptive information, or use ensemble learners.

Based on the result in Theorem 5.1, we can construct pointwise $1-\alpha$ confidence intervals for $\psi_{t}(\delta)$ as

[TABLE]

where $\hat{\sigma}^{2}(\delta,t)$ is the variance estimator defined in Theorem 5.1. Following Kennedy (2019), one may use the multiplier bootstrap for uniform inference, by replacing the $z_{1-\alpha/2}$ critical value with $c_{\alpha}$ satisfying

[TABLE]

We refer to Kennedy (2019) for details on how to construct $c_{\alpha}$ via the multiplier bootstrap.

6 Infinite Time Horizon Analysis

The great majority of causal inference literature considers a finite time horizon where the number of timepoints $T$ is small and fixed, or even just equal to one, a priori ruling out much significant (if any) longitudinal structure. However, in practice more and more studies accumulate data across very many timepoints, due to ever increasing advances in data collection technology. In fact, in many applications $T$ can even be comparable to or larger than sample size $n$ . This renders most of the classical methods based on finite time horizons obsolete, as their theoretical results/analysis have not been validated in such time horizon where $T$ can grow to infinity. For example, Kumar et al. (2013) describe how new mobile and wearable sensing technologies have revolutionized randomized trials and other health-care studies by providing data at very high sampling rates (10-500 times per second). Klasnja et al. (2015); Qian et al. (2020) use $210$ timepoints in their study of micro-randomized trials for evaluating just-in-time adaptive interventions via mobile applications. As we collect more granular and fine-grained data, some recent studies explore efficient off-policy estimation techniques in infinite-time horizon settings (e.g., Liu et al. (2018) in reinforcement learning). Interestingly, though, there has been no formal analysis for general longitudinal studies.

Therefore in this section we analyze the behavior of the IPW version of our proposed estimator (relative to the standard IPW estimator in classical deterministic settings), in a more realistic regime where $T$ can scale with sample size. To the best of our knowledge, this is one of the first such infinite-horizon analyses in causal inference, outside of some recent similarly specialized examples involving dynamic treatment regimes (Laber et al., 2018; Ertefaie and Strawderman, 2018). Specifically, we study the variance ratio bound, and show how deterministic effects are afflicted by an inflated variance relative to incremental intervention effects as $T$ grows.

We proceed with comparing the variances of estimators of the deterministic effect for the always-treated (receiving treatment at every timepoint) versus the incremental effect for $\delta>1$ . For simplicity and concreteness, in what follows we consider a simple setup where the propensity scores are all equal to $p$ (i.e., $\pi_{t}(H_{t})=p$ for all $t$ ) and there is no dropout (i.e. $d\mathbb{P}\{R_{t+1}=1\}=1$ a.e. $[\mathbb{P}]$ for all $t=1,...,T$ ). This makes the pseudo-regression functions $m_{s}$ in Theorem 4.1 equal to zero. In this setup we have unbiased estimators of the always-treated effect $\psi_{at}=\mathbb{E}(Y^{\overline{\bm{1}}_{T}})$ and the incremental effect $\psi_{inc}=\mathbb{E}(Y^{\overline{Q}_{T}(\delta)})$ given by

[TABLE]

and

[TABLE]

respectively, where $Y=Y_{T}$ . In the next theorem, we analyze the variance ratio of the two estimators and show that one can achieve near-exponential precision gains by targeting $\psi_{inc}$ .

Theorem 6.1.

Consider the estimators and conditions defined above. Further assume that $\left|Y\right|\leq{b_{u}}$ for some constant $b_{u}>0$ and $\mathbb{E}\left[\left(Y^{\overline{\bm{1}}_{T}}\right)^{2}\right]>0$ . Then for any $T\geq 1$ ,

[TABLE]

where $C_{T}=\frac{b_{u}^{2}}{\mathbb{E}\left[\left(Y^{\overline{\bm{1}}_{T}}\right)^{2}\right]}$ and $\zeta(T;p)=\left(1+\frac{c\left(\mathbb{E}\left[Y^{\overline{\bm{1}}_{T}}\right]\right)^{2}}{\left(1/p\right)^{T}\mathbb{E}\left[\left(Y^{\overline{\bm{1}}_{T}}\right)^{2}\right]}\right)$ for any fixed value of $c$ such that $\frac{1}{1-p^{T}{\left(\mathbb{E}\left[Y^{\overline{\bm{1}}_{T}}\right]\right)^{2}}\big{/}{\mathbb{E}\left[\left(Y^{2}\right)^{\overline{\bm{1}}_{T}}\right]}}\leq{c}$ .

The proof of the above theorem is given in Appendix F.3 and is based on the similar logic used in deriving the g-formula (Robins, 1986). Note that we only require two very mild assumptions in the above theorem: the boundedness assumption on $Y$ , and $\mathbb{E}[(Y^{\overline{\bm{1}}_{T}})^{2}]>0$ , which is equivalent to saying $Y^{\overline{\bm{1}}_{T}}$ is a non-degenerate random variable. In the proof, we give a more general result for any sequence $\overline{a}_{T}\in\overline{\mathcal{A}}_{T}$ as well.

Theorem 6.1 allows us to precisely quantify the relative statistical certainty in estimating the two effects. Specifically, since $\frac{\delta^{2}p^{2}+p(1-p)}{(\delta p+1-p)^{2}}<1$ for $\delta>1$ and $\zeta(T;p)$ is bounded (and converging to one monotonically), the variance ratio decays exponentially in $T$ . This implies that we may reap extraordinary gains in statistical precision from targeting ${\psi}_{inc}$ instead of ${\psi}_{at}$ , when we intend on incorporating substantial number of timepoints in the study. The same goes for effects for the never-treated versus the incremental interventions with $\delta<1$ (see Appendix F.3).

Remark 3.

The variance ratio we study in this section is somewhat distinct from usual relative efficiency, since here we are considering two different (but closely related) target parameters. However, when we are indifferent about the inferential target, the variance ratio still can serve as a useful guidance in selecting an estimator. As $\delta\rightarrow\infty$ , the gap between two effects monotonically shrinks to zero and the two target parameters $\psi_{at}$ and $\psi_{inc}$ become eventually identical, so the variance ratio goes to $1$ . On the other hand, for finite $\delta$ , the two effects are not quite the same, but how much one versus the other is of more interest is debatable ( $\psi_{inc}$ could indeed be more preferable if we aim to describe how outcomes would vary with more practical gradual changes in treatment intensity). If we do not have a strong reason to prefer one effect over the other, we could choose the one with smaller variance in favor of improved statistical precision. This issue also arises for local effects under positivity violations, instrumental variables, etc. (e.g., Imbens, 2014; Aronow, 2016; Crump et al., 2009), where an estimand is adaptively chosen on the basis of smaller variance.

In what follows we refine Theorem 6.1 so that one can characterize the minimum number of timepoints to guarantee a smaller variance for ${\psi}_{inc}$ .

Corollary 6.1.

There exists a finite number $T_{min}$ such that

[TABLE]

for every $T>T_{min}$ , where $T_{min}$ is never greater than

[TABLE]

The proof is given in Appendix F.4. The proof of the above corollary relies upon the fact that $\text{Var}(\widehat{\psi}_{inc})$ can be represented as a variance of the weighted sum of the IPW estimators for $Y^{\overline{a}_{T}}$ , $\forall\overline{a}_{T}\in\overline{\mathcal{A}}_{T}$ (see Lemma F.6 in the appendix).

Remark 4.

It may be possible to further tighten the upper bound for $T_{min}$ , but considering the focus of our paper this would not be very illuminating and practically meaningful, since the value of $T_{min}$ in the above corollary is already quite small in general. To illustrate, consider $Y\in[0,1]$ , $\delta=2.5,p=0.5$ , and an extreme case of $c_{\bm{1}}=0.05$ (i.e., $Y^{\overline{\bm{1}}_{T}}$ is mostly concentrated around [math]). Then $T_{min}=6$ . If we use $\delta=5,p=0.5$ , then $T_{min}=9$ .

Theorem 6.1 and Corollary 6.1 can be generalized to the case of observational studies where the nuisance functions need to be estimated, but our view is that the simple case captures the main ideas and the general case would only add complexity.

To empirically assess the validity of Theorem 6.1, we conduct two simple simulation studies as below.

Simulation 1 (Randomized Trial). We set $p=0.5$ and let $Y\ \mid\ \overline{A}_{t}\sim N\left(10+{A}_{t},1\right)$ truncated at $\pm$ two standard deviations. Based on this data generation process, given a value of $\delta$ , we generate 100 different datasets for $t=1,...,50$ , $n=500$ , where we make sure the positivity assumption is valid in our simulation 111This is done in a similar spirit to Laplace smoothing in Naive Bayes.. Then we compute the sample variance of each estimator and their ratio correspondingly, and present them in Figure 1.

Simulation 2 (Observational Study). Although not directly addressed in Theorem 6.1, here we also consider the setting for observational studies. Specifically, we consider a model

[TABLE]

for all $t\leq T$ , where we let $\mu(\overline{X}_{t},\overline{A}_{t})=10+A_{t}+A_{t-1}+|((\bm{1}^{\top}X_{t}+\bm{1}^{\top}X_{t-1})\mid$ , $\bm{1}=[1,1]^{\top}$ and let $expit$ denote the inverse logit function. In this simulation, we assume that it is more (less) likely to receive a treatment if a subject has (not) received treatments recently. The rest of the specification remains the same as Simulation 1. The result is presented in Figure 2.

The simulation results support our theoretical results. Overall, the result in this section provides crucial insight into the longitudinal study with many timepoints, suggesting that massive gains in statistical certainty are possible by studying incremental rather than classical deterministic effects.

7 Experiments

7.1 Simulation Study

In this section we explore finite-sample performance of the proposed estimator $\hat{\psi}_{t}(\delta)$ via synthetic simulation. We consider the following data generation model

[TABLE]

for $s=1,...,t$ . $r_{d}>0$ is a constant used to control an average amount of dropout units. In this setup we assume that the more likely subjects have been treated, the more likely they will receive the treatment in the next timepoint in general. Moreover, the dropout probability at each $s\leq t$ is largely driven by the sign of $X_{3,s}$ : the dropout probability will be low (high) if $X_{3,s}>0$ $(<0)$ . Therefore, although each $X_{3,s}$ is designed to have a symmetric, bimodal distribution with a mean of [math], the value of $X_{3,s}$ for surviving subjects will tend to be much greater than [math]. Due to the way $X_{3,s}$ also interact with the outcome in the above model, discarding all the subjects that have dropped out should lead to an upward-biased estimate of the incremental intervention effects. In Appendix B, we provide auxiliary figures for the sake of better understanding of our simulation. Variables akin to $X_{3}$ that considerably affect both outcome and dropout are commonly found in practice (e.g., side effects).

We estimate the incremental effect at $t=4$ . We compare our proposed estimator ( ${\hat{\psi}_{ours}}$ ) with three baseline methods: the naive Z-estimator ( $\hat{\psi}_{inc.pi}$ ) and the IPW estimator ( $\hat{\psi}_{inc.ipw}$ ), both of which are defined in Section 5.1, and the original incremental-effect estimator ( $\hat{\psi}_{inc.nc}$ ) proposed by Kennedy (2019). Note that for using $\hat{\psi}_{inc.nc}$ we have to discard samples that have ever dropped out, whereas in other estimators surviving subjects are properly re-weighted for dropout adjustment at each timepoint. Since finite-sample properties of $\hat{\psi}_{inc.nc}$ were already extensively explored in Kennedy (2019), here we primarily focus on the effect of dropout in the longitudinal setting.

To estimate nuisance parameters, following Kennedy (2019) we form an ensemble of some widely-used nonparametric models. Specifically, we use the cross-validation-based superleaner ensemble algorithm (Van der Laan et al., 2007) via the SuperLearner package in R to combine support vector machine, random forest, k-nearest neighbor regression. For ${\hat{\psi}_{ours}}$ and $\hat{\psi}_{inc.nc}$ , we use $K=2$ -fold sample splitting.

We repeat simulation $S=250$ times in which we draw $n$ samples each simulation. We use $D=30$ values of $\delta$ equally spaced on the log-scale within $[0.2,3]$ . As in Kennedy (2019), performance of each estimator is assessed via integrated bias and root-mean-squared error (RMSE) defined by

[TABLE]

where $\hat{\psi}_{s}(t;\delta_{d})$ and ${\psi}(t;\delta_{d})$ are the estimate and true value of the target parameter respectively, for $s$ -th simulation and $\delta_{d}$ . We present the results in Table 1.

As shown in Table 1, when there is a substantial amount of subject dropout, $\hat{\psi}_{inc.nc}$ shows much worse performance than all the other dropout-adjusted estimators, which is expected by the design of our data generation model. However, this gap shrinks as dropout rates decrease. Also in each setting, the proposed estimator ${\hat{\psi}_{ours}}$ appears to perform better and more markedly improve with sample size than $\hat{\psi}_{inc.pi}$ and $\hat{\psi}_{inc.ipw}$ . This behavior is indicative of the validity of our theory that ${\hat{\psi}_{ours}}$ is not only able to adjust for the dropout process, but more efficient than $\hat{\psi}_{inc.pi}$ and $\hat{\psi}_{inc.ipw}$ .

7.2 Application

Here we illustrate the proposed methods in analyzing the Effects of Aspirin on Gestation and Reproduction (EAGeR) data, which evaluates the effect of daily low-dose aspirin on pregnancy outcomes and complications. The EAGeR trial was the first randomized trial to evaluate the effect of pre-conception low-dose aspirin on pregnancy outcomes (Schisterman et al., 2014; Mumford et al., 2016). However, to date this evidence has been limited to intention-to-treat analyses.

The design and protocol used for the EAGeR study have been previously documented (Schisterman et al., 2013). Overall, 1,228 women were recruited into the study (615 aspirin, 613 placebo) and 11% of participants chose to drop out of the study before completion. Roughly 43,000 person-weeks of information were available from daily diaries, as well as study questionnaires, and clinical and telephone evaluations collected at regular intervals over follow-up. The dataset is characterized by a substantial degree of non-compliance (more than 50% at the end of the study), and thereby is susceptible to positivity violation.

We used our incremental propensity score approach to evaluate the effect of aspirin on pregnancy outcomes in the EAGeR trial, accounting for time-varying exposure and dropout. We let each variable become a constant equal to its final value after the time point at which no more actual data is collected on the subject, so we have balanced panel data as described in (1). Here, the study terminates at week 89 ( $T=89$ ). We use 24 baseline covariates (e.g., age, race, income, education, etc.) and 5 time-dependent covariates (compliance, conception, vaginal bleeding, nausea and GI discomfort). $A_{t}$ is a binary treatment variable coded as $1$ if a woman took aspirin at time $t$ and [math] otherwise. $R_{t}=1$ indicates that the woman is observed in the study at time $t$ . Lastly, $Y_{t}$ is an indicator of having a pregnancy outcome of interest at time $t$ . We are particularly interested in two types of pregnancy outcomes: live birth and pregnancy loss (fetal loss). We perform separate analysis for each of the two cases.

For comparative purposes, we estimate the simple complete-case effect

[TABLE]

which relies on both non-compliance and drop-out being completely randomized. The value of $\widehat{\psi}_{CC}$ is 0.052 (5.2%) for live birth and 0.012 (1.2%) for pregnancy loss, both of which are close to the intention-to-treat estimates reported in Schisterman et al. (2013, 2014).

We give a brief discussion on why standard modeling approaches fail here. We found strong evidence of positivity violations in the EAGER dataset; as shown in Figure 7 in Appendix C, the average propensity score quickly drops to zero as $t$ grows. This suggests that very few patients follow the given protocol of taking aspirin late in the study. Thus, it is unrealistic to use an intervention where all participants would take aspirin at every time, as required in many standard models including the popular marginal structural models (MSMs) (Robins et al., 2000). In fact, when we modeled the effect curve by $\mathbb{E}[Y^{\overline{a}_{T}}]=m(\overline{a}_{T};\beta)=\beta_{0}+\sum_{t=1}^{T}\beta_{1t}a_{t}$ so that the coefficient for exposure can vary with time, then the standard inverse-weighted MSM fits failed and no coefficient estimates were found even for moderate values of $T=\sim 10$ (see Figure 7-(b) in Appendix C for a closer look). This positivity violation precludes other standard approaches for time-varying treatments as well.

One quick remedy could be to move away from standard ATEs and instead only estimate the mean outcome if no one were treated, comparing to the observed outcome (not all versus none as in the ATE). Then we can apply some other nonparametric approaches available in the literature for estimating this one-sided counterfactual. When we use the g-computation (plug-in) estimator (Robins, 1986), the result seems to suggest that the mean outcome if no one received aspirin is worse than the observed (Figure 8 in Appendix C). However, when we use the sequential doubly robust (SDR) estimator (Luedtke et al., 2017), the huge overlap between 95% CI intervals prevents us from drawing any firm conclusion (Figure 9 in Appendix C).

Now, we estimate the incremental effect curve $\psi_{T}(\delta)$ , which represents the probability of having live birth or pregnancy loss at the end of the study ( $t=T$ ) if the odds of taking aspirin for all women were increased by a factor of $\delta$ at all timepoints, across different values of $\delta$ . Again, we use the cross-validated superleaner algorithm (Van der Laan et al., 2007) to combine support vector machine, random forest, and k-nearest neighbor regression to estimate a tuple of nuisance functions $(m_{t},\omega_{t},\pi_{t})$ at each $t\leq T$ . We proceed with sample splitting with $K=2$ splits, and use 10,000 bootstrap replications to compute pointwise and uniform confidence intervals. Results are shown in Figure 3.

The estimated curve in Figure 3 appears to be almost flat for live birth, and have a slightly negative gradient with respect to $\delta$ (odds ratio) for pregnancy loss. However, at level $\alpha=.05$ we fail to reject the null of no incremental intervention effects for both cases (as both confidence bands contain a horizontal line). This is mainly due to the noncompliance of aspirin takers that makes the bands too wide for large $\delta$ regimes. Thus, our analysis yielded a similar result to the previous findings of Schisterman et al. (2014), indicating that use of low dose aspirin was not significantly associated with live birth or pregnancy loss. Nonetheless, the estimated incremental intervention effects provide more detailed information with greater nuance, requiring none of the parametric and positivity assumptions.

Remark 5.

In this analysis we have looked into the intervention that does tell us about the effect of overall increase or decrease in treatment at each $t$ , but not the optimal timing of treatment (e.g., when aspirin should be prescribed since conception). As pointed out by Kennedy (2019), one could address such timing issues by considering $\delta$ depending on time and covariate history, which will bring added complexity. We leave this to our future work.

8 Discussion

Incremental interventions are a novel class of stochastic dynamic intervention where positivity assumptions can be completely avoided. However, they had not been extended to repeated outcomes, and without further assumptions do not give identifiability under dropout, both of which are very common in practice. In this paper we solved this problem by showing how incremental intervention effects are identified and can be estimated when dropout occurs (conditionally) at random. Even in the case of many dropouts, our proposed method efficiently uses all the data without sacrificing robustness. We gave an identifying expression for incremental intervention effects under monotone dropout, without requiring any positivity assumptions. We established general efficiency theory and constructed the efficient influence function, and presented nonparametric estimators which converge at fast rates and yield uniform inferential guarantees, even when all the nuisance functions are estimated with flexible machine learning tools at slower rates. Furthermore, we analyzed the variance ratio of incremental intervention effects to conventional deterministic dynamic intervention effects in a novel infinite time horizon setting in which the number of timepoints can possibly grow with sample size, and showed that incremental intervention effects can yield near-exponential gains in statistical precision. Finally, we showed that the proposed methods can effectively mitigate the bias caused by subject dropout via the simulation study, and applied the methods in study of the effect of low-dose aspirin on pregnancy outcomes.

There are a number of avenues for future work. The first is application to other substantive problems in medicine and the social sciences. For example, in a forthcoming paper we analyze the effect of aspirin on pregnancy outcomes with more extensive data. It will also be important to consider other types of non-monotone missingness where the standard time-varying MAR assumption A2-M may not be appropriate (Sun and Tchetgen, 2014; Tchetgen et al., 2016). We expect that our approach can be extended to other important problems in causal inference; for example, one could develop incremental intervention effects for continuous treatments and instruments (Kennedy et al., 2017, 2019), or for mediation in the same spirit as (Díaz and Hejazi, 2019), but generalized to the longitudinal case with dropout. Developing incremental-based sensitivity analyses for the longitudinal MAR assumption would also be an interesting extension.

Acknowledgement

Edward Kennedy and Ashley Naimi gratefully acknowledge financial support from the NSF (Grant # DMS1810979) and NIH (Grant # R01HD093602) for this research, respectively. This work was also supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institutes of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, contract numbers HHSN267200603423, HHSN267200603424, and HHSN267200603426. We are also grateful for useful comments from two anonymous referees. This work was completed while Kwangho Kim was a PhD student at Carnegie Mellon University.

Appendix A Algorithm

An algorithm detailing how to compute the proposed estimator (6) at $t\leq T$ is given in Algorithm 1 as below.

Appendix B Auxiliary figures for the simulation study

We provide some auxiliary figures to help readers better understand the simulation setup and result presented in Section 7.1 using a random example. Figures 4 and 5 illustrate how the dropout process may induce a large upward bias in estimation of incremental effects as shown in Figure 6. Figure 6 also shows our methods can successfully adjust for dropout. All the results in this example are measured at $t=4$ with the dropout rate of 52.5%.

Appendix C Alternative approaches for the EAGeR data analysis

Here, we discuss why standard approaches fail for our analysis of the EAGER dataset in Section 7.2 of the main text. For comparative purposes, we alter our target effect and then apply some other nonparametric approaches available in the literature. Then we compare the result with the one we obtained in Section 7.2.

C.1 Why standard model fails: positivity violation

All the standard models dealing with time-varying treatments, except on very rare occasions, require treatment positivity. However, as will be elaborated below, positivity is likely violated in the EAGER dataset. Many individuals turned out not to follow the given protocol of taking aspirin and this non-compliance only exacerbates over time. To illustrate this, we present the average propensity score over time in Figure 7-(a). As shown in Figure 7-(a), the average propensity score quickly drops to zero as $t$ grows. In other words, Figure 7-(a) implies that it would be hard to imagine having all of the study participants take aspirin at each time.

Even if positivity is only nearly violated, it can pose a serious problem in attempting to estimate our target causal effect. One of the most widely-used approaches to handle time-varying treatments is marginal structural models (MSMs) (Robins et al., 2000). In practice, MSMs are often estimated via inverse probability weighting (IPW). The following quantity appears in the IPW (also in the doubly robust) moment condition

[TABLE]

for any choice of $h$ (with matching dimensions) where $\widehat{\pi}_{t}(a_{t})=\widehat{\mathbb{P}}(A_{t}=a_{t}\mid H_{t})$ . However, Figure 7-(b) indicates that on average a cumulative product of propensity score sharply drops to zero even with moderate $t$ . This would make standard estimation techniques such as IPW to fail as $\mathbb{P}_{n}(\prod_{j=1}^{T}\widehat{\pi}_{j})$ easily blows up.

More specifically, when we parametrically model the effect curve by $\mathbb{E}[Y^{\overline{a}_{T}}]=m(\overline{a}_{T};\beta)=\beta_{0}+\sum_{t=1}^{T}\beta_{1t}a_{t}$ so that the coefficients for exposure can vary with time, an inverse-weighted MSM estimator that is the solution to

[TABLE]

indeed fails since no coefficient estimates can be found in the above equation even for moderate values of $T$ , e.g., $T=\sim 10$ . Thus, it appears that positivity violation in our dataset precludes the standard MSM-based approach. We remark that these limitations are not at all unique to the analysis of our EAGeR dataset, but also common to many observational studies based on the MSM or other approaches (e.g., Luedtke et al., 2017).

C.2 Alternative approach

Due to the positivity violation, the estimation results, if any, through standard approaches will remain dubious at best. Therefore, we alter our target contrast from the standard ATE to the mean outcome we would have observed in a population if “observed” versus none (not all versus none) were treated, which is defined by

[TABLE]

where $\overline{a^{\text{obs}}}$ denotes an observed history of aspirin consumption. This new estimand would tell us how the mean outcome would have changed if no one in the population had taken aspirin throughout the study. In this way, we can avoid estimating the problematic counterfactual $\mathbb{E}\big{[}Y^{\bar{A}_{T}=\bar{\mathbf{1}},\bar{R}_{T}=\bar{\mathbf{1}}}\big{]}$ . However, by construction this solution entails the fundamental limitation because we have sacrificed the causal effect of original interest.

In order to estimate our new causal parameter (7), here we use the g-computation 222We also tried a weighting estimator but omitted the result here, since it gives almost the same result as the g-computation, only with wider confidence bands. (plug-in) estimator (Robins, 1986) and the sequential doubly robust (SDR) estimator proposed by Luedtke et al. (2017) which allows right-censored data structures.

C.3 Estimation and inference

Estimation. First for the g-computation estimator, we estimate the following g-formula

[TABLE]

via plug-in estimators of the pseudo-outcome regression function each time step. Next, for the SDR estimator, we tailor Algorithm 2 of Luedtke et al. (2017) for our right-censored data structures (everything remains the same except that we add the condition $\overline{R}_{t-1}=\overline{\mathbf{1}}_{t-1}$ on each pseudo-outcome regression function). For both methods, we use the same nonparametric ensemble we used in Section 7.2.

Inference. Confidence intervals are estimated by bootstrapping at 95% level for both of the estimators. Note that for the SDR estimator, we are guaranteed to consistently estimate standard errors (pointwisely) by bootstrapping due to the following asymptotically property,

[TABLE]

for all $t\leq T$ , where $\phi_{\tau}(t)$ is the influence function of $\widehat{\tau}_{\text{obs}}(t)$ . However, this is no longer guaranteed for the g-computation estimator.

C.4 Result

For the sake of completeness, we estimate each $\tau_{\text{obs}}(t)$ for all $t=2\sim 89$ and present the cumulative effects over time $t$ . The results for the g-computation and the SDR estimators are presented in Figures 8, 9, respectively.

The result based on the g-computation estimator in Figure 8 shows that the counterfactual mean outcomes for never-takers (individuals who have never taken aspirin throughout the study) are worse-off than the observed. Specifically, for the never-takers the probability of having live birth has been decreased and the probability of having fetal loss has increased. The result seems to be statistically significant at $T=89$ .

On the other hand, the result based on the SDR estimator in Figure 9 indicates that although the mean effects for the never-takers still appear to be worse off than the observed, they look no longer statistically significant. Hence in this case we cannot draw any firm conclusion about the effect of aspirin on pregnancy outcome.

It might be tempting to take the results from Figure 8 as it seems to deliver more clear messages. However, we do not know if our variance estimates there are correct in the first place. Accuracy of our estimate is further afflicted by the moderate sample size ( $n$ =1024) due to the slow convergence rates of the plug-in regression. These issues can be mitigated in the SDR estimator. Thus, we should rather resort to the results presented in Figure 9, which basically tells us that the effect of low-dose aspirin is insignificant and remains dubious, based on the causal effect defined in (7).

After all, it should be noted that due to the positivity violation we end up limiting ourselves to the more narrow notion of causal effects (i.e. observed versus none) which is different from the ATE type estimands that are typically of utmost interest for policy makers. The causal effect in (7) might not be practically meaningful as to aspirin prescription for pregnant since we are in general much more interested in the always-taker group than the never-taker group.

Appendix D More details on influence functions and efficiency bound

Here, we shall introduce the influence function, which is a foundational object of statistical theory that allows us to characterize a wide range of estimators with favorable theoretical properties. There are two notions of the influence function: one for estimators and the other for parameters. To distinguish these two cases we will call the latter, which corresponds to parameters, influence curves as in for example, Boos and Stefanski (2013); Kennedy (2016) 333However, the terms ‘influence curve’ and ‘influence function’ are used interchangeably in many cases.. Before we go on, we declare that the primary sources of this section are Kennedy (2014, 2016, 2020) from which all the terms, definitions and results are directly borrowed.

First, we give a definition of influence curves. It was first introduced by Hampel (1974) and studied to provide a general solution to find approximation-by-averages representation for a functional statistic. We only consider nonparametric models here.

Suppose that we are given a target functional $\psi$ . For a nonparametric model $\mathbb{P}$ , let $\{\mathbb{P}_{\epsilon}$ , $\epsilon\in\mathbb{R}\}$ denote a smooth parametric submodel for $\mathbb{P}$ with $\mathbb{P}_{\epsilon=0}=\mathbb{P}$ . A typical example of this parametric submodel can be given by $\{\mathbb{P}_{\epsilon}:p_{\epsilon}(z)=p(z)(1+\epsilon s(z))\}$ for some mean-zero, uniformly bounded function $s$ . Then the influence curve for parameter $\psi(\mathbb{P})$ is defined by any mean-zero, finite-variance function $\phi(\mathbb{P})$ that satisfies the following pathwise differentiability,

[TABLE]

The above pathwise differentiability implies that our target parameter $\psi$ is smooth enough to admit a von Mises expansion: for two distribution $\mathbb{P},\mathbb{Q}$

[TABLE]

where $R_{2}$ is a second-order remainder. Therefore, the influence curve corresponds to the functional derivative in a Von Mises expansion of $\psi$ .

One can obtain the classical Cramér-Rao lower bound for each parametric submodel $\mathbb{P}_{\epsilon}$ ; the Cramér-Rao lower bound for $\mathbb{P}_{\epsilon}$ is $\psi^{\prime}(\mathbb{P}_{\epsilon})^{2}/\mathbb{E}(s_{\epsilon}^{2})$ where $\psi^{\prime}(\mathbb{P}_{\epsilon})=\frac{\partial}{\partial\epsilon}\psi(\mathbb{P}_{\epsilon})\big{|}_{\epsilon=0}$ and $s_{\epsilon}=s_{\epsilon}(z)=\frac{\partial}{\partial\epsilon}\log d\mathbb{P}_{\epsilon}\big{|}_{\epsilon=0}$ . The asymptotic variance of any nonparametric estimator is no smaller than the supremum of the Cramér-Rao lower bounds for all parametric submodel, and it is known that under the above pathwise differentiability condition the greatest such lower bound is given by

[TABLE]

Hence, $\mathbb{E}(\phi^{2})=\text{Var}(\phi)$ is the nonparametric analog of the Cramér-Rao lower bound, and we call the influence curve that attains the above bound the efficient influence curve. The efficient influence curve gives the efficiency bound for estimating $\psi$ . In parametric models, more than one influence curves may exist. On the other hand in nonparametric model, the influence curve is unique. However, the efficient influence curve is always unique in any cases.

Once the efficient influence curve is known, no estimator can be more efficient than $\hat{\psi}(\mathbb{P})$ such that

[TABLE]

as $\text{Var}(\phi)$ serves to be our nonparametric efficiency bound. In (10), we call $\phi$ the (efficient) influence function for the estimator $\hat{\psi}$ 444In fact, influence curves themselves are the putative influence functions.. For each nonparametric estimator, the efficient influence function, if exists, is almost surely unique, so in this sense the influence function contains all information about an estimator’s asymptotic behavior. In other words, if we know the influence function for an estimator, we know its asymptotic distribution and can easily construct confidence intervals and hypothesis tests.

Characterizing the influence curves is crucial not only to give the efficiency bound for estimating $\psi$ , thus providing a benchmark against which estimators can be compared, but probably more importantly, to construct estimators with very favorable properties, such as double robustness or general second-order bias. One may can find an (asymptotically linear) estimator that satisfies (10) by solving appropriate estimating equations using the influence curves. Section F.2 of the appendix contains an example of developing an efficient, model-free estimator based on the efficient influence curve of the target parameter.

Finally we remark that for complicated functionals, pretending discrete space on $Z$ can facilitate our procedure to characterize influence curves. For example, assuming that our unit space is discrete, the influence curve $\phi(\mathbb{P})$ for the functional $\psi(\mathbb{P})$ can be defined by

[TABLE]

where we let $\delta_{z}$ be the Dirac measure at $Z=z$ . This definition is equivalent to the Gateaux derivative of $\psi$ at $\mathbb{P}$ in direction of point mass $(\delta_{z}-\mathbb{P})$ (see, for example, Chapter 5 in Boos and Stefanski (2013)).

For more details for nonparametric efficiency theory and influence functions, we refer to Kennedy (2014, 2016, 2020); van der Laan and Robins (2003); Tsiatis (2006).

Appendix E Additional Technical Results

E.1 Sequential regression formulation

The efficient influence function derived in the previous subsection involves pseudo-regression functions $m_{s}$ . To avoid complicated conditional density estimation, as suggested by Kennedy (2019), one may formulate a series of sequential regressions for $m_{s}$ , as described in the subsequent remark.

Remark 6.

From the definition of $m_{s}$ , it immediately follows that

[TABLE]

Hence, we can find equivalent form of the functions $m_{s}(\cdot)$ in Theorem 4.1 as the following recursive regression:

[TABLE]

for $s=1,...,t-1$ , where we use shorthand notation $m_{s+1}(H_{s+1},a_{s+1},1)=m_{s+1}(H_{s+1},A_{s+1}=a_{s+1},R_{t+2}=1)$ and $m_{s}(H_{s},A_{s},1)=\mu(H_{s},A_{s},R_{s+1}=1)$ .

Above sequential regression form is practically useful since it allows us to bypass all the conditional density estimations and instead use regression methods that are more readily available in statistical software.

E.2 EIF for $T=1$

In the next corollary we provide the efficient influence function for the incremental effect for a single timepoint study ( $T=1$ ) whose identifying expression is given in Corollary 3.1.

Corollary E.1.

When $T=1$ , the efficient influence function for $\psi(\delta)$ in Corollary 3.1 is given by

[TABLE]

where

[TABLE]

and

[TABLE]

which is the uncentered efficient influence function for $\mathbb{E}[\mu(X,a,1)]$ .

The efficient influence function for the point exposure case has a simpler and more intuitive form. As stated in Corollary E.1, it is a weighted average of the two efficient influence functions $\phi_{0,R=1},\phi_{1,R=1}$ , plus a contribution term due to unknown propensity scores.

Appendix F Proofs

F.1 Lemma for the identifying expression in Theorem 3.1

Without assumptions (A2-M) and (A3), our target parameter $\psi_{t}(\delta)=\mathbb{E}\left(Y_{t}^{\overline{Q}_{t}(\delta)}\right)$ would not be identified. The following lemma extends Theorem 1 in Kennedy (2019) to our setting.

Lemma F.1.

Under (A2-M) and (A3), and for all $t\leq T$ , we have following identities:

a.

$d\mathbb{P}(A_{t}|H_{t})=d\mathbb{P}(A_{t}|H_{t},R_{t}=1)$ **

b.

$d\mathbb{P}(Y_{t-1},X_{t}|A_{t-1},H_{t-1})=d\mathbb{P}(Y_{t-1},X_{t}|A_{t-1},H_{t-1},R_{t}=1)$ **

c.

$\mathbb{E}[Y_{t}|{H}_{t},{A}_{t}]=\mathbb{E}[Y_{t}|{H}_{t},{A}_{t},R_{t+1}=1]$ **

Proof.

a.

$\bm{d\mathbb{P}(A_{t}|H_{t})=d\mathbb{P}(A_{t}|H_{t},R_{t}=1)}$ By abuse of notation, for $s<t$ , here we let $\underline{X}_{s}$ , $\underline{A}_{s}$ represent $(X_{s},...,X_{t})$ , $(A_{s},...,A_{t})$ respectively, and $\underline{Y}_{s-1}$ represent $(Y_{s-1},...,Y_{t-1})$ . First note that

[TABLE]

where the first equality follows by definition, the second by definition of conditional probability, the third by assumption (A2-M), the fourth again by definition of conditional probability, the fifth by assumption (A2-M), and the sixth by repeating the same step $t-1$ times. The last expression is obtained by simply rearranging terms using the definition of conditional probability.

Now we let

[TABLE]

so we can write $d\mathbb{P}(A_{t},H_{t})=d\mathbb{P}(\overline{X}_{t},\overline{A}_{t},\overline{Y}_{t-1},R_{t}=1){{\bm{\Pi}}}_{\mathbb{P}}(t-1)$ .

Then, similarly we have

[TABLE]

Hence, finally we obtain

[TABLE]

where the second equality comes from the above results. The proof naturally leads to $\bm{dQ_{t}(A_{t}|H_{t})=dQ_{t}(A_{t}|H_{t},R_{t}=1)}$ . 2. b.

$\bm{d\mathbb{P}(Y_{t-1},X_{t}|A_{t-1},H_{t-1})=d\mathbb{P}(Y_{t-1},X_{t}|A_{t-1},H_{t-1},R_{t}=1)}$

By definition $d\mathbb{P}(Y_{t-1},X_{t}|A_{t-1},H_{t-1})=d\mathbb{P}(H_{t})/d\mathbb{P}(A_{t-1},H_{t-1})$ , and from the part a) it immediately follows

[TABLE]

Hence, we have

[TABLE]

which yields the desired result. 3. c.

$\bm{\mathbb{E}[Y_{t}|{H}_{t},{A}_{t}]=\mathbb{E}[Y_{t}|{H}_{t},{A}_{t},R_{t+1}=1]}$

By definition $\mathbb{E}[Y_{t}|{H}_{t},{A}_{t}]=\int yd\mathbb{P}(Y_{t}=y|{H}_{t},{A}_{t}),$ and thereby it suffices to show that $d\mathbb{P}(Y_{t}|{H}_{t},{A}_{t})=d\mathbb{P}(Y_{t}|{H}_{t},{A}_{t},R_{t+1})$ .

By the same logic we used for the first proof, we have

[TABLE]

and also

[TABLE]

Hence, by Assumption (A2-M) we have that

[TABLE]

∎

Following the exact same logic used in the proof of Kennedy (2019, Theorem 1), under Assumptions A1 and A2-E, for all $s<t$ we have the recursion formula

[TABLE]

Applying the above $t$ times leads to

[TABLE]

Finally, Assumption A1 and Lemma F.1 give

[TABLE]

F.2 Proof of Theorem 4.1

F.2.1 Identifying expression for the efficient influence function

In the next lemma, we provide an identifying expression for the efficient influence function for our incremental effect $\psi_{t}(\delta)$ under a nonparametric model, which allows the data-generating process $\mathbb{P}$ to be infinite-dimensional.

Lemma F.2.

Define

[TABLE]

for $s=0,...,{t}-1$ , $\forall t\leq T$ , where we write $\mathcal{R}_{s}=(\overline{\mathcal{X}}_{t}\times\overline{\mathcal{A}}_{t})\setminus(\overline{\mathcal{X}}_{s}\times\overline{\mathcal{A}}_{s})$ and $\mu(h_{t},a_{t},R_{t+1}=1)=\mathbb{E}(Y_{t}\mid H_{t}=h_{t},A_{t}=a_{t},R_{t+1}=1)$ . For $s=t$ and $s=t+1$ , we set $m_{s}(\cdot)=\mu(h_{t},a_{t},R_{t+1}=1)$ and $m_{t+1}(\cdot)=Y$ . Moreover, let $\frac{\mathbbm{1}(H_{s}=h_{s},R_{s}=1)}{d\mathbb{P}(h_{s},R_{s}=1)}\phi_{s}(H_{s},A_{s},R_{s}=1;a_{s})$ denote the efficient influence function for $dQ_{s}(a_{s}|h_{s},R_{s}=1)$ .

Then, the efficient influence function for $m_{0}=\psi_{t}(\delta)$ is given by

[TABLE]

where we define $dQ_{t+1}=1$ , $m_{t+1}(\cdot)=Y$ , and $dQ_{0}(a_{0}|h_{0})/d\mathbb{P}(a_{0}|h_{0})=1$ , and $\nu$ is a dominating measure for the distribution of $A_{s}$ .

The proof of Lemma F.2 involves derivation of efficient influence function for more general stochastic interventions that depend on the both observational propensity scores and right-censoring process. We begin by presenting the following three additional lemmas.

Lemma F.3 (Kennedy (2019)).

For $\forall t$ , the efficient influence function for

[TABLE]

which is defined in (2) is given by $\frac{\mathbbm{1}(H_{t}=h_{t},R_{t}=1)}{d\mathbb{P}(h_{t},R_{t}=1)}\phi_{t}(H_{t},A_{t},R_{t}=1;a_{t})$ , where $\phi_{t}(H_{t},A_{t},R_{t}=1;a_{t})$ equals

[TABLE]

where $\pi_{t}(h_{t})=\mathbb{P}(A_{t}=1\mid H_{t}=h_{t},R_{t}=1)$ .

Lemma F.4.

Suppose $\overline{Q}_{T}$ is not depending on $\mathbb{P}$ . Recall that for $\forall t\leq T$ ,

[TABLE]

for $s=0,...,{t}-1$ , where we write $\mathcal{R}_{s}=(\overline{\mathcal{X}}_{t}\times\overline{\mathcal{A}}_{t})\setminus(\overline{\mathcal{X}}_{s}\times\overline{\mathcal{A}}_{s})$ and $\mu(h_{t},a_{t},R_{t+1}=1)=\mathbb{E}(Y_{t}\mid H_{t}=h_{t},A_{t}=a_{t},R_{t+1}=1)$ . Note that from definition of $m_{s}$ it immeidately follows $m_{s}=\int_{\mathcal{X}_{s}\times\mathcal{A}_{s}}m_{s+1}dQ_{s+1}(a_{s+1}\mid h_{s+1},R_{s+1}=1)d\mathbb{P}(x_{s+1}|h_{s},a_{s},R_{s+1}=1)$ .

Now the efficient influence function for $\psi^{*}(\overline{Q}_{t})=m_{0}$ is

[TABLE]

where we define $dQ_{t+1}=1$ , $m_{t+1}(\cdot)=Y_{t}$ , and $dQ_{0}(a_{0}|h_{0})/d\mathbb{P}(a_{0}|h_{0})=1$ .

Lemma F.5.

Suppose $\overline{Q}_{T}$ depends on $\mathbb{P}$ and let $\frac{\mathbbm{1}(H_{t}=h_{t},R_{t}=1)}{d\mathbb{P}(h_{t},R_{t}=1)}\phi_{t}(H_{t},A_{t},R_{t}=1;a_{t})$ denote the efficient influence function for $dQ_{t}(a_{t}|h_{t},R_{t}=1)$ defined in Lemma F.3 for all $t$ . Then the efficient influence function for $\psi_{t}(\delta)$ is given as

[TABLE]

where $\varphi^{*}(\overline{Q}_{t})$ is the efficient influence function from Lemma F.4 and $\nu$ is a dominating measure for the distribution of $A_{s}$ .

The proof of Lemma F.3, F.4 and F.5 is basically a series of chain rules, after specifying efficient influence functions for terms that repeatedly appear. We provide a brief sketch for the proof of Lemma F.4 and F.5 below, which can be easily extendable to the full proof. This also could be useful to develop other results for more general stochastic interventions.

Proof of Lemma F.4 and Lemma F.5

Let $\mathcal{IF}:\psi\rightarrow\phi$ denote a map to the efficient influence function $\phi$ for a functional $\psi$ . First, without proof, we specify efficient influence functions for mean and conditional mean which serve two basic ingredients for our proof. For mean value of a random variable $Z$ , we have

[TABLE]

and for conditional mean with a pair of random variables $(X,Y)\sim\mathbb{P}$ when $X$ is discrete, we have

[TABLE]

These results can be obtained by applying (8) or (11).

Proof.

It is sufficient to prove for $t=2$ since it is straightforward to extend the proof for arbitrary $t\leq T$ by induction. For $t=2$ , it is enough to compute the following four terms.

A)

$\begin{aligned} &\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mathcal{IF}\Big{(}\mu(h_{2},a_{2},R_{3}=1)\Big{)}\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)d\mathbb{P}(y_{s-1},x_{s}|h_{s-1},a_{s-1},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\frac{\mathbbm{1}\{(H_{2},A_{2},R_{3})=(h_{2},a_{2},1)\}}{d\mathbb{P}(h_{2},a_{2},R_{3}=1)}\Big{\{}Y-\mu(h_{2},a_{2},R_{3}=1)\Big{\}}\\ &\qquad\qquad\ \times\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)d\mathbb{P}(y_{s-1},x_{s}|h_{s-1},a_{s-1},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mathbbm{1}\big{\{}(H_{2},A_{2},R_{3})=(h_{2},a_{2},1)\big{\}}\big{\{}Y-\mu(h_{2},a_{2},R_{3}=1)\big{\}}\\ &\qquad\qquad\ \times\prod_{s=1}^{2}\frac{dQ_{s}(a_{s}\mid h_{s},R_{s}=1)}{d\mathbb{P}(a_{s}\mid h_{s},R_{s}=1)}\frac{1}{d\mathbb{P}(R_{s+1}=1\mid h_{s},a_{s},R_{s}=1)}\\ &=\{Y-\mu(H_{2},A_{2},R_{3}=1)\}\mathbbm{1}(R_{3}=1)\prod_{s=1}^{2}\frac{dQ_{t}(A_{s}\mid H_{s},R_{s}=1)}{d\mathbb{P}(A_{s}\mid H_{s},R_{s}=1)}\frac{1}{d\mathbb{P}(R_{s+1}=1\mid H_{s},A_{s},R_{s}=1)}\end{aligned}$

B)

$\begin{aligned} &\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)\mathcal{IF}\Big{(}d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\Big{)}d\mathbb{P}(h_{1})\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)\frac{\mathbbm{1}\big{\{}(H_{1},A_{1},R_{2})=(h_{1},a_{1},1)\big{\}}}{d\mathbb{P}(h_{1},a_{1},R_{2}=1)}\\ &\qquad\qquad\ \times\Big{\{}\mathbbm{1}(Y_{1}=y_{1},X_{2}=x_{2})-d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\Big{\}}d\mathbb{P}(h_{1})\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)\\ &\qquad\qquad\ \times\frac{\mathbbm{1}\big{\{}(H_{1},A_{1},R_{2})=(h_{1},a_{1},1)\big{\}}\big{\{}\mathbbm{1}(Y_{1}=y_{1},X_{2}=x_{2})-d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\big{\}}}{d\mathbb{P}(R_{2}=1|h_{1},a_{1})d\mathbb{P}(a_{1}|h_{1})d\mathbb{P}(h_{1})}\\ &\qquad\qquad\ \times d\mathbb{P}(h_{1})\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid h_{2},R_{2}=1)\mathbbm{1}\big{\{}(H_{1},A_{1},R_{2})=(h_{1},a_{1},1)\big{\}}\\ &\qquad\qquad\ \times\big{\{}\mathbbm{1}(Y_{1}=y_{1},X_{2}=x_{2})-d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\big{\}}\frac{dQ_{1}(A_{1}\mid H_{1})}{d\mathbb{P}(A_{1}\mid H_{1})}\frac{1}{d\mathbb{P}(R_{2}=1\mid H_{1},A_{1})}\\ &=\Bigg{\{}\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}\setminus\mathcal{H}_{2}}\mu(H_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid H_{2},R_{2}=1)\\ &\qquad-\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}\setminus\mathcal{H}_{1}\times\mathcal{A}_{1}}\mu(h_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid h_{2},R_{2}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\Bigg{\}}\\ &\qquad\times\mathbbm{1}(R_{2}=1)\frac{dQ_{1}(A_{1}\mid H_{1})}{d\mathbb{P}(A_{1}\mid H_{1})}\frac{1}{d\mathbb{P}(R_{2}=1\mid H_{1},A_{1})}\\ &=\Bigg{\{}\int_{\mathcal{A}_{2}}\mu(H_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid H_{2},R_{2}=1)-m_{1}(h_{1},a_{1},R_{2}=1)\Bigg{\}}\\ &\qquad\times\mathbbm{1}(R_{2}=1)\frac{dQ_{1}(A_{1}\mid H_{1})}{d\mathbb{P}(A_{1}\mid H_{1})}\frac{1}{d\mathbb{P}(R_{2}=1\mid H_{1},A_{1})}\\ \end{aligned}$

C)

$\begin{aligned} &\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\mathcal{IF}\Big{(}d\mathbb{P}(h_{1})\Big{)}\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\big{\{}\mathbbm{1}(X_{1}=x_{1})-d\mathbb{P}(x_{1})\big{\}}\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}\setminus\mathcal{H}_{1}}\mu(h_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid h_{2},R_{2}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)dQ_{1}(a_{1}|h_{1})-m_{0}\\ &=\int_{\mathcal{A}_{1}}m_{1}(h_{1},a_{1},R_{2}=1)dQ_{1}(a_{1}|h_{1})-m_{0}\\ \end{aligned}$

D)

Let $\frac{\mathbbm{1}(H_{t}=h_{t},R_{t}=1)}{d\mathbb{P}(h_{t},R_{t}=1)}\phi_{t}(H_{t},A_{t},R_{t}=1;a_{t})$ denote the efficient influence function for $dQ_{t}(a_{t}|h_{t},R_{t}=1)$ as given in Lemma F.3. Then we have

$\begin{aligned} &\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(h_{1})d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\mathcal{IF}\Big{(}dQ_{1}(a_{1}|h_{1})dQ_{2}(a_{2}\mid h_{2},R_{2}=1)\Big{)}\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(h_{1})d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\frac{\mathbbm{1}\big{\{}(H_{2},R_{2})=(h_{2},1)\big{\}}}{d\mathbb{P}(h_{2},R_{2}=1)}\phi_{2}dQ_{1}(a_{1}|h_{1})\\ &\quad+\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(h_{1})d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\frac{\mathbbm{1}\big{\{}(H_{1}=h_{1})\big{\}}}{d\mathbb{P}(h_{1})}\phi_{1}dQ_{2}(a_{2}\mid h_{2},R_{2}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)\frac{\mathbbm{1}\big{\{}(H_{2},R_{2})=(h_{2},1)\big{\}}d\mathbb{P}(h_{1})d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)dQ_{1}(a_{1}|h_{1})}{d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)d\mathbb{P}(R_{2}=1|h_{1},a_{1})d\mathbb{P}(a_{1}|h_{1})d\mathbb{P}(h_{1})}\phi_{2}\\ &\quad+\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\mathbbm{1}\big{\{}(H_{1}=h_{1})\big{\}}\phi_{1}dQ_{2}(a_{2}\mid h_{2},R_{2}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}\setminus\mathcal{H}_{2}}\mu(H_{2},a_{2},R_{3}=1)\mathbbm{1}(R_{2}=1)\phi_{2}\frac{dQ_{1}(A_{1}\mid H_{1})}{d\mathbb{P}(A_{1}\mid H_{1})}\frac{1}{d\mathbb{P}(R_{2}=1\mid H_{1},A_{1})}\\ &\quad+\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}\setminus\mathcal{H}_{1}}\mu(h_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid h_{2},R_{2}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\phi_{1}\\ &=\left\{\frac{dQ_{1}(A_{1}\mid H_{1})}{d\mathbb{P}(A_{1}\mid H_{1})}\frac{1}{d\mathbb{P}(R_{2}=1\mid H_{1},A_{1})}\right\}\int_{\mathcal{A}_{2}}\mu(H_{2},a_{2},R_{3}=1)\phi_{2}d\nu(a_{2})\mathbbm{1}(R_{2}=1)\\ &\quad+\int_{\mathcal{A}_{1}}m_{1}(h_{1},a_{1},R_{2}=1)\phi_{1}d\nu(a_{1})\end{aligned}$

Note that we have set $dQ_{0}(a_{0}|h_{0})/d\mathbb{P}(a_{0}|h_{0})=1$ , and that we have $d\mathbb{P}(R_{1}=1)=1$ and $\mathbbm{1}(R_{1}=1)=1$ by construction. Hence, putting part A), B), and C) together proves Lemma F.4 and part D) proves Lemma F.5.

Note that to formally verify that the given expressions in Lemmas F.4 and F.5 are the efficient influence functions, we would need to check if the pathwise differentiability formula (8) holds. This essentially follows if the remainder terms are second-order, which will be verified in Lemmas (F.7) and (F.8) later. ∎

Finally, we are ready to give a proof of Theorem 4.1. In fact, it is nothing but rearranging terms in the given efficient influence function.

F.2.2 Proof of Theorem 4.1

Proof.

First, we define following shorthand notations for the proof: for $\forall s\leq t$

[TABLE]

With these notations we can rewrite the result of Lemma F.4 as below.

[TABLE]

Now, by the result of Lemma F.4 and F.5, we can represent the efficient influence function for $\psi_{t}(\delta)$ as

[TABLE]

On the other hand, we have

[TABLE]

which leads to

[TABLE]

After some rearrangement, we finally obtain an equivalent form of the efficient influence function for $\psi_{t}(\delta)$ by

[TABLE]

Note that we use convention that $dQ_{0}=d\mathbb{P}_{0}=d\omega_{0}=1$ and $R_{1}=1$ . ∎

F.3 Proof of Theorem 6.1

Let $\widehat{\psi}_{c.ipw}(\overline{a^{\prime}}_{T})$ denote the standard IPW estimator of a classical deterministic intervention effect $\mathbb{E}\left[Y^{\overline{a^{\prime}}_{T}}\right]$ under $i.i.d$ assumption, i.e.

[TABLE]

Hence $\widehat{\psi}_{c.ipw}(\overline{\bm{1}})$ is equivalent to $\widehat{\psi}_{at}$ in the main text. Now by definition we have

[TABLE]

where $\mathbb{V}_{c.ipw.1}(\overline{a^{\prime}}_{T})$ and $\mathbb{V}_{c.ipw.2}(\overline{a^{\prime}}_{T})$ are simply the first and second term in the first line of the expansion respectively.

By the same procedure to derive g-formula (Robins, 1986) it is easy to see

[TABLE]

where $\mathcal{X}=\mathcal{X}_{1}\times\cdots\times\mathcal{X}_{T}$ . Above result simply follows by iterative expectation conditioning on $\overline{X}_{t}$ and then another iterative expectation conditioning on $H_{t}$ followed by the fact that $\mathbb{E}\left[\frac{\mathbbm{1}\left({A}_{t}={a^{\prime}}_{t}\right)}{\pi_{t}(a^{\prime}_{t}|H_{t})}\big{|}H_{t}\right]=1$ for all $t$ . We repeat this process $T$ times, starting from $t=T$ all the way through $t=1$ .

Likewise, for $\widehat{\psi}_{inc}$ we have

[TABLE]

For the first term $\mathbb{V}_{inc.1}$ , observe that

[TABLE]

where we apply the law of total expectation in the first equality and the law of total probability in the second.

After repeating the same process for $T-1$ times, for $t=T-1,...,1$ , we obtain $2^{T}$ terms in the end where each of which corresponds to the distinct treatment sequences $\overline{A}_{T}=\overline{a}_{T}$ . Hence, we eventually have

[TABLE]

Recall that we assume $\pi_{t}(H_{t})=p$ for all $t$ as stated in Theorem 6.1. Hence we can write $\pi_{t}(a_{t}\mid H_{t})$ as $\pi_{t}(a_{t})=\mathbbm{1}\left({a}_{t}=1\right)p+\mathbbm{1}\left({a}_{t}=0\right)\{1-p\}$ .

We want to find an upper bound of the variance ratio $\text{VR}(\widehat{\psi}_{c.ipw}(\overline{a}_{T}),\widehat{\psi}_{inc})\coloneqq\frac{\mathbb{V}_{inc.1}-\mathbb{V}_{inc.2}}{\mathbb{V}_{c.ipw.1}(\overline{a}_{T})-\mathbb{V}_{c.ipw.2}(\overline{a}_{T})}$ for always-treated unit (i.e., $\overline{a}_{T}=\overline{\bm{1}}$ ). This can be done by computing the quantity

[TABLE]

since $0<\mathbb{V}_{inc.2}<\mathbb{V}_{inc.1}$ by Jensen’s inequality.

Note that we have

[TABLE]

, and under the given boundedness assumption we see the ratio of the second term to the first term becomes quickly (at least exponentially) negligible as $t$ increases. Hence we can write

[TABLE]

for some constant $c$ such that $\frac{1}{1-\mathbb{V}_{c.ipw.2}(\overline{\bm{1}})/\mathbb{V}_{c.ipw.1}(\overline{\bm{1}})}=\frac{1}{1-p^{T}{\left(\mathbb{E}\left[Y^{\overline{\bm{1}}}\right]\right)^{2}}\big{/}{\mathbb{E}\left[\left(Y^{\overline{\bm{1}}}\right)^{2}\right]}}\leq{c}$ . Note that in our setting in which we have an infinitely large value of $T$ , $c$ can be almost any constant greater than one.

Putting above ingredients together, for sufficiently large $t$ it follows that

[TABLE]

where we have

[TABLE]

where the first equality follows by the fact that $\mathbb{V}_{inc.1}=\sum_{\overline{a}_{T}\in\overline{\mathcal{A}}_{T}}w(\overline{a}_{T};\delta,p)\mathbb{V}_{c.ipw.1}(\overline{a}_{T})$ derived in the proof of the first part, the second equality by the fact that $\mathbb{V}_{c.ipw.1}(\overline{a}_{T})=\prod_{t=1}^{T}\frac{1}{\pi_{t}(a_{t})}\mathbb{E}\left[\left(Y^{2}\right)^{\overline{a}_{T}}\right]$ , the first inequality by definition of $w(\overline{a}_{T};\delta,p)$ and the given boundedness assumption, and the last equality by binomial theorem. Therefore we obtain the upper bound as

[TABLE]

Next for the lower bound, first we note that

[TABLE]

where the first equality follows by definition, the second equality by exactly same process used to find the expression for $\mathbb{V}_{inc.1}$ , the first inequality by the boundedness assumption, and the third equality by binomial theorem.

However, we already know that

[TABLE]

Hence putting these together we conclude

[TABLE]

At this point, we obtain upper and lower bound for $\text{VR}(\widehat{\psi}_{c.ipw}(\overline{\bm{1}}),\widehat{\psi}_{inc})$ , which yields the result of part $ii)$ having $C_{T}=\frac{b_{u}^{2}}{\mathbb{E}\left[\left(Y^{\overline{\bm{1}}}\right)^{2}\right]}$ .

Proof for the case of $\overline{a^{\prime}}_{T}=\overline{\bm{0}}$ (never-treated unit) is based on the almost same steps as the case of $\overline{a^{\prime}}_{T}=\overline{\bm{1}}$ except for the rearragement of terms due to replacing $\left(\frac{1}{p}\right)^{T}$ by $\left(\frac{1}{1-p}\right)^{T}$ and so on. In fact, due to the generality of our proof structure, the exact same logic used for $\widehat{\psi}_{c.ipw}(\overline{\bm{1}})$ also applies to $\widehat{\psi}_{c.ipw}(\overline{\bm{0}})$ (and $\widehat{\psi}_{c.ipw}(\overline{a^{\prime}}_{T})$ for $\forall\overline{a^{\prime}}_{T}\in\overline{\mathcal{A}_{T}}$ ). We present the result without the proof as below.

[TABLE]

where we define $C_{T}^{\prime}=\frac{b_{u}^{2}}{\mathbb{E}\left[\left(Y^{2}\right)^{\overline{\bm{0}}}\right]}$ and $\zeta^{\prime}(T;p)=\left(1+\frac{c\left(\mathbb{E}\left[Y^{\overline{\bm{1}}}\right]\right)^{2}}{\left(1/(1-p)\right)^{T}\mathbb{E}\left[\left(Y^{\overline{\bm{1}}}\right)^{2}\right]}\right)$ .

F.4 Proof of Corollary 6.1

Now we provide following Lemma F.6 which becomes a key to prove Corollary 6.1.

Lemma F.6.

Assume that $\pi_{t}(H_{t})=p$ for all $1\leq t\leq T$ for $0<p<1$ . Then we have following variance decomposition :

[TABLE]

where for $\forall\overline{a}_{T}\in\overline{\mathcal{A}}_{T}$ the weight $w$ is defined by

[TABLE]

Proof.

From the last display for $\mathbb{V}_{inc.1}$ , we have that

[TABLE]

where we let weight $w(\overline{a}_{T};\delta,p)$ denote the product term $\prod_{t=1}^{T}\frac{\pi_{t}(a_{t})\left(\mathbbm{1}\left({a}_{t}=1\right)\delta^{2}p+\mathbbm{1}\left({a}_{t}=0\right)\{1-p\}\right)}{(\delta{\pi}_{t}(H_{t})+1-{\pi}_{t}(H_{t}))^{2}}$ .

Next, we observe that

[TABLE]

where we have decomposed $\mathbb{V}_{inc.2}$ into $2^{T}\times 2^{T}$ terms by defining $v_{inc.2}(\overline{A}_{T};\overline{a}_{T})$ by

[TABLE]

Then for fixed $\overline{a}_{T}$ it is straightforward to see that

[TABLE]

Now putting this together, we obtain

[TABLE]

However, from the second term in the last display one could notice that

[TABLE]

where the last equality follows by the fact that

[TABLE]

Hence finally we conclude that

[TABLE]

∎

In Lemma F.6 it should be noticed that the weight $w(\overline{a}_{T};\delta,p)$ exponentially and monotonically decays to zero for $\forall\overline{a}_{T}\in\overline{\mathcal{A}}_{T}$ .

Now we show that there always exists $T_{min}$ such that $\text{Var}(\widehat{\psi}_{inc})<\text{Var}(\widehat{\psi}_{c.ipw}(\overline{\bm{1}}))$ for all $T\geq T_{min}$ . Let $\overline{\bm{1}}=[1,...,1]$ . From Lemma F.6 it follows that

[TABLE]

where $c_{\bm{1}}=\frac{\mathbb{E}\left[\left(Y^{\overline{\bm{1}}}\right)^{2}\right]}{b^{2}_{u}}$ , $A(\delta,p)=\sum_{\overline{a}_{T},\overline{a^{\prime}}_{T}\in\overline{\mathcal{A}}_{T}}\sqrt{w(\overline{a}_{T};\delta,p)w(\overline{a^{\prime}}_{T};\delta,p)}\frac{\mathbb{E}\left[Y^{\overline{a}_{T}}\right]}{b_{u}}\frac{\mathbb{E}\big{[}Y^{\overline{a^{\prime}}_{T}}\big{]}}{b_{u}}$ , and $B=\frac{\left(\mathbb{E}\big{[}Y^{{\overline{\bm{1}}}}\big{]}\right)^{2}}{b^{2}_{u}}$ . We note that $|A(\delta,p)|\leq 1$ , $0\leq B\leq 1$ , and $c^{1/T}_{\bm{1}}\rightarrow 1$ as $T\rightarrow\infty$ .

For $\delta>1$ , $\frac{\delta^{2}p+1-p}{(\delta p+1-p)^{2}}<\frac{1}{p}$ . Hence based on above observation, it follows that for sufficiently large $T$ the last display is strictly less than zero. Consequently we conclude $\text{Var}(\widehat{\psi}_{inc})-\text{Var}(\widehat{\psi}_{c.ipw}(\overline{\bm{1}}))<0$ for all $T\geq T_{min}$ , which is the result of part $i)$ . Likewise, we have the same conclusion for $\overline{\bm{0}}_{T}=[0,...,0]$ such that $\text{Var}(\widehat{\psi}_{inc})-\text{Var}(\widehat{\psi}_{c.ipw}(\overline{\bm{0}}_{T}))<0$ .

The value of $T_{min}$ is determined by $\delta,p$ , and distribution of counterfactual outcome $Y^{\overline{a}_{T}}$ . One rough upper bound of such $T_{min}$ is

[TABLE]

which could be obtained by the last display above and is always finite due to the fact $c_{\bm{1}}>0$ by given assumption in the theorem. $T_{min}$ should not be very large for moderately large value of $\delta$ unless $c_{\bm{1}}$ is unreasonably small since the difference $\frac{1}{p^{T}}-\left[\frac{\delta^{2}p+1-p}{(\delta p+1-p)^{2}}\right]^{T}$ also grows exponentially.

F.5 Proof of Theorem 5.1

First we need to define the following notations:

[TABLE]

where we let $\mathcal{T}=\{1,...,T\}$ , let $\mathbb{G}_{n}$ denote the empirical process on the full sample as usual, and let $\widetilde{\varphi}(Z;\bm{\eta},\delta,t)=\{\varphi(Z;\bm{\eta},\delta,t)-\psi(t;\delta)\}/\sigma(\delta;t)$ and let $\mathbb{G}$ be a mean-zero Gaussian process with covariance $\mathbb{E}[\mathbb{G}(\delta_{1};t_{1})\mathbb{G}(\delta_{2};t_{2})]=\mathbb{E}\left[\widetilde{\varphi}(Z;\bm{\eta},\delta_{1},t_{1})\widetilde{\varphi}(Z;\bm{\eta},\delta_{2},t_{2})\right]$ as defined in Theorem 5.1 in the main text.

The proof consists of two parts; in the first part we will show $\Psi_{n}(\cdot)\leadsto\mathbb{G}(\cdot)$ in $l^{\infty}(\mathcal{D},\mathcal{T})$ and in the second we will show $\|\widehat{\Psi}_{n}-\Psi_{n}\mid_{\mathcal{D},\mathcal{T}}=o_{\mathbb{P}}(1)$ .

Part 1. A proof of the first statement immediately follows from the proof of Theorem 3 in Kennedy (2019) who showed the function class $\mathcal{F}_{\bar{\bm{\eta}}}=\{\varphi(\cdot;\bar{\bm{\eta}},\delta):\delta\in\mathcal{D}\}$ is Lipschitz and thus has a finite bracketing integral for any fixed set of nuisance functions. Then Theorem 2.5.6 in Van Der Vaart and Wellner (1996) gives the result. In our case, the function class $\mathcal{F}_{\bar{\bm{\eta}}}=\{\varphi(\cdot;\bar{\bm{\eta}},\delta,t):\delta\in\mathcal{D},t\leq T\}$ is still Lipschitz, since for $\forall t\in\{1,...,T\}$ we have

[TABLE]

where we use assumption 1) and 2) in the Theorem, and the identification assumption (A3) that there exist a constant $\epsilon_{\omega}$ such that $0<\epsilon_{\omega}<\omega_{t}(h_{t},a_{t})\leq 1$ and thus $\frac{1}{\omega_{t}(h_{t},a_{t})}\leq\frac{1}{\epsilon_{\omega}}$ a.e. [ $\mathbb{P}$ ]. Therefore, every $\varphi(\cdot;\bar{\bm{\eta}},\delta,t)$ is basically a finite sum of products of Lipschitz functions with bounded $\mathcal{D}$ and we thus conclude $\mathcal{F}_{\bar{\bm{\eta}}}$ is Lipschitz. Hence our function class still has a finite bracketing integral for fixed $\bar{\bm{\eta}}$ and $t$ , which completes the first part of our proof.

Part 2. Let $N=n/K$ be the sample size in any group $k=1,...,K$ , and denote the empirical process over group k units by $\mathbb{G}^{k}_{n}=\sqrt[]{N}(\mathbb{P}^{k}_{n}-\mathbb{P})$ . From the result of Part 1 and the proof of Theorem 3 in Kennedy (2019) we have

[TABLE]

Now we analyze two pieces $B_{n,1}(\delta;t)$ and $B_{n,2}(\delta;t)$ in the last display. $B_{n,1}(\delta;t)=o_{\mathbb{P}}(1)$ follows by the exact same steps done by Kennedy (2019). However, analysis on $B_{n,2}(\delta;t)$ requires extra work.

To analyze $B_{n,2}(\delta;t)$ , we use the same notation used in Kennedy (2019). First let $\psi(\mathbb{P};Q)$ denote the mean outcome under intervention $Q$ for a population corresponding to observed data distribution $\mathbb{P}$ . Next, let $\varphi^{*}(z;{\bm{\eta},t})$ denote its centered efficient influence function when $Q$ does not depend on $\mathbb{P}$ , as given in Lemma F.4 and let $\zeta^{*}(z;{\bm{\eta}},t)$ denote the contribution to the efficient influence function $\varphi^{*}(z;{\bm{\eta},t})$ due to estimating $Q$ when it depends on $\mathbb{P}$ , as given in Lemma F.5. Now by definition,

[TABLE]

and after some rearrangement we obtain

[TABLE]

Although one can relate $\overline{\bm{\eta}}$ to $\widehat{\bm{\eta}}_{-k}$ in above equation, it can be anything associated with new $\overline{\mathbb{P}}$ and $\overline{Q}$ .

Hence, by analyzing the second order remainder terms of von Mises expansion for the efficient influence functions given in Lemma F.4 and F.5, we can evaluate the convergence rate of $B_{n,2}(\delta;t)$ . The following two lemmas analyze those second order remainder terms in the presence of dropout process.

Lemma F.7.

Let $\psi({\mathbb{P}};Q)$ be a mean outcome under intervention $Q$ for a for a population corresponding to observed data distribution $\mathbb{P}$ , and let $\varphi^{*}(z;{\bm{\eta}},t)$ denote its efficient influence function when $Q$ does not depend on $\mathbb{P}$ for given $t$ , as given in Lemma F.4. For another data distribution $\overline{\mathbb{P}}$ , let $\overline{\bm{\eta}}$ denote the corresponding nuisance functions. Then we have the 1st-order von Mises expansion

[TABLE]

where we define

[TABLE]

Proof.

From Lemma F.4, we have

[TABLE]

where the first equality follows by the definition and linearity of expectation, the second by iterated expectation and the equivalence between $\mathbbm{1}(R_{s+1}=1)$ and $\mathbbm{1}(R_{s+1}=1,R_{s}=1)$ 555For $\forall s\leq t$ the event $\{R_{s}=1\}$ implies $\{R_{s^{\prime}}=1$ for all $s^{\prime}\leq s\}$ by construction. , the third by the law of total probability on conditional expectation 666For random variables $X,Y,Z$ , when $Z$ is discrete it follows $\mathbb{E}[X|Y]=\sum_{z}\mathbb{E}[X|Y,Z=z]\mathbb{P}(Z=z|Y)$ ., the fourth by the result of Lemma F.1 (i.e. $d\mathbb{P}_{s+1}=d\mathbb{P}(X_{s+1}\mid H_{s},A_{s},R_{s+1}=1)$ ). To obtain the last equality, we first apply iterated expectation conditioning on $(H_{s},R_{s})$ , then do another iterated expectation conditioning on $(H_{s-1},A_{s-1},R_{s-1})$ followed by same steps from the second, the third and the fourth equalities, and repeat these processes for $s-2,...,1$ .

From the last expression, now we have

[TABLE]

Note that we use the convention from earlier lemmas that all the quantities with negative times (e.g., $dQ_{-1}$ ) are set to one. After repeating above process $t-1$ times to the second last term in the last display, we obtain that

[TABLE]

By Lemma 5 in Kennedy (2019) it follows

[TABLE]

Putting all these together, we have

[TABLE]

, which yields the formula we have in Lemma F.7. ∎

Lemma F.8.

Let $\zeta^{*}(z;\overline{\bm{\eta}},t)$ denote the contribution to the efficient influence function $\varphi^{*}(z;{\bm{\eta}},t)$ due to dependence between $\mathbb{P}$ and $Q$ as given in Lemma F.5. Then for two different intervention distributions $Q$ and $\overline{Q}$ whose corresponding densities are $dQ_{t}$ and $d\overline{Q}_{t}$ respectively with respect to some dominating measure for $t=1,...,t$ , we have the 1st-order Von Mises expansion

[TABLE]

where we define all the notation in the same way in Lemma F.7.

Proof.

From Lemma 6 in Kennedy (2019) and by Lemma F.1, we have

[TABLE]

Next, for the expected contribution to the influence function due to estimating $Q$ when it depends on $\mathbb{P}$ , we have that

[TABLE]

where the first equality by definition, the second by iterated expectation conditioning on $(H_{s},R_{s})$ and averaging over $A_{s}$ , the third by iterated expectation conditioning on $(H_{s-1},A_{s-1},R_{s-1})$ and law of total probability, and the fifth by repeating the process $t$ times.

Now, we further expand our last expression as

[TABLE]

where the first equality follows by adding and subtracting the second term, an the second by the same steps used in Lemma F.7.

With the last term in the last expression above, it follows

[TABLE]

Putting these all together, finally we have

[TABLE]

∎

Finally, the next Lemma completes the proof of the Theorem 5.1.

Lemma F.9.

Remainders of the von Mises expansion from Lemma F.7 and F.8 are both diminishing at rate of $n^{-\frac{1}{2}}$ uniformly in $\delta$ , if

[TABLE]

for $\forall r\leq s\leq t$ .

Proof.

The remainder term of the Von Mises expansion from Lemma F.7 equals

[TABLE]

where we obtain the first inequality simply by adding and subtracting $m_{s}$ .

For the remainder term from Lemma F.8, first note that by Lemma F.1 and Lemma 6 of Kennedy (2019),

[TABLE]

Hence, it immediately follows that the remainder term in Lemma F.8 can be bounded by

[TABLE]

Therefore, if we have

[TABLE]

then both of the remainders from Lemma F.7 and F.8 are diminishing at rate of $n^{-\frac{1}{2}}$ uniformly in $\delta$ . ∎

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Kumar et al. (2013) Santosh Kumar, Wendy J Nilsen, Amy Abernethy, Audie Atienza, Kevin Patrick, Misha Pavel, William T Riley, Albert Shar, Bonnie Spring, Donna Spruijt-Metz, et al. Mobile health technology evaluation: the mhealth evidence workshop. American journal of preventive medicine , 45(2):228–236, 2013.
2Eysenbach et al. (2011) Gunther Eysenbach, Consort-EHEALTH Group, et al. Consort-ehealth: improving and standardizing evaluation reports of web-based and mobile health interventions. Journal of medical Internet research , 13(4), 2011.
3Klasnja et al. (2015) Predrag Klasnja, Eric B Hekler, Saul Shiffman, Audrey Boruvka, Daniel Almirall, Ambuj Tewari, and Susan A Murphy. Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology , 34(S):1220, 2015.
4Robins (1986) James Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling , 7(9-12):1393–1512, 1986.
5Robins et al. (2000) James M Robins, Miguel Angel Hernan, and Babette Brumback. Marginal structural models and causal inference in epidemiology, 2000.
6Hernán et al. (2000) Miguel Ángel Hernán, Babette Brumback, and James M Robins. Marginal structural models to estimate the causal effect of zidovudine on the survival of hiv-positive men. Epidemiology , pages 561–570, 2000.
7Murphy et al. (2001) Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association , 96(456):1410–1423, 2001.
8Robins (2004) James M Robins. Optimal structural nested models for optimal sequential decisions. In Proceedings of the second seattle Symposium in Biostatistics , pages 189–326. Springer, 2004.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Incremental Intervention Effects

Abstract

1 Introduction

2 Setup

Remark 1**.**

3 Identification

Assumption A1**.**

Assumption A2-E**.**

Assumption A2-M**.**

Assumption A3**.**

Theorem 3.1**.**

Corollary 3.1**.**

4 Efficiency Theory

Theorem 4.1**.**

Remark 2**.**

5 Estimation and Inference

5.1 Proposed Estimator

5.2 Asymptotic Theory

Theorem 5.1**.**

6 Infinite Time Horizon Analysis

Theorem 6.1**.**

Remark 3**.**

Corollary 6.1**.**

Remark 4**.**

7 Experiments

7.1 Simulation Study

7.2 Application

Remark 5**.**

8 Discussion

Acknowledgement

Appendix A Algorithm

Appendix B Auxiliary figures for the simulation study

Appendix C Alternative approaches for the EAGeR data analysis

C.1 Why standard model fails: positivity violation

C.2 Alternative approach

C.3 Estimation and inference

C.4 Result

Appendix D More details on influence functions and efficiency bound

Appendix E Additional Technical Results

E.1 Sequential regression formulation

Remark 6**.**

E.2 EIF for T=1T=1T=1

Corollary E.1**.**

Appendix F Proofs

F.1 Lemma for the identifying expression in Theorem 3.1

Lemma F.1**.**

Proof.

F.2 Proof of Theorem 4.1

F.2.1 Identifying expression for the efficient influence function

Lemma F.2**.**

Lemma F.3** (Kennedy (2019)).**

Lemma F.4**.**

Lemma F.5**.**

Proof of Lemma F.4 and Lemma F.5

Proof.

F.2.2 Proof of Theorem 4.1

Proof.

F.3 Proof of Theorem 6.1

F.4 Proof of Corollary 6.1

Lemma F.6**.**

Proof.

F.5 Proof of Theorem 5.1

Lemma F.7**.**

Proof.

Lemma F.8**.**

Proof.

Lemma F.9**.**

Proof.

Remark 1.

Assumption A1.

Assumption A2-E.

Assumption A2-M.

Assumption A3.

Theorem 3.1.

Corollary 3.1.

Theorem 4.1.

Remark 2.

Theorem 5.1.

Theorem 6.1.

Remark 3.

Corollary 6.1.

Remark 4.

Remark 5.

Remark 6.

E.2 EIF for $T=1$

Corollary E.1.

Lemma F.1.

Lemma F.2.

Lemma F.3 (Kennedy (2019)).

Lemma F.4.

Lemma F.5.

Lemma F.6.

Lemma F.7.

Lemma F.8.

Lemma F.9.