TL;DR
This paper extends incremental intervention effects to longitudinal studies with dropout and many timepoints, providing new estimators and theoretical guarantees for efficient inference in complex settings.
Contribution
It generalizes incremental intervention effects to handle multiple outcomes and dropout, deriving identifying expressions and efficient estimators with strong theoretical properties.
Findings
Estimators converge at fast parametric rates.
Incremental effects offer near-exponential gains in precision.
Methods are validated through simulations and applied to aspirin study.
Abstract
Modern longitudinal studies collect feature data at many timepoints, often of the same order of sample size. Such studies are typically affected by {dropout} and positivity violations. We tackle these problems by generalizing effects of recent incremental interventions (which shift propensity scores rather than set treatment values deterministically) to accommodate multiple outcomes and subject dropout. We give an identifying expression for incremental intervention effects when dropout is conditionally ignorable (without requiring treatment positivity), and derive the nonparametric efficiency bound for estimating such effects. Then we present efficient nonparametric estimators, showing that they converge at fast parametric rates and yield uniform inferential guarantees, even when nuisance functions are estimated flexibly at slower rates. We also study the variance ratio of incremental…
| n | Average Dropouts (%) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| 14.5 | 24.1 | 30.1 | 9.8 | 1.59 | 2.78 | 2.96 | 1.37 | 50.5 | |
| 1000 | 12.5 | 14.7 | 19.8 | 8.3 | 1.31 | 1.84 | 2.01 | 1.14 | 28.0 |
| 10.7 | 11.3 | 9.5 | 7.2 | 1.17 | 1.35 | 1.13 | 0.99 | 8.9 | |
| 10.3 | 12.0 | 23.8 | 7.1 | 1.15 | 1.34 | 1.29 | 1.05 | 49.6 | |
| 2500 | 10.2 | 10.9 | 14.1 | 6.2 | 1.06 | 1.19 | 1.06 | 0.95 | 27.5 |
| 7.8 | 7.5 | 5.3 | 4.5 | 0.94 | 1.03 | 0.93 | 0.91 | 9.1 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDropout
\doparttoc\faketableofcontents
Incremental Intervention Effects
in Studies with Dropout
and Many Timepoints
Kwangho Kim, Edward H. Kennedy, Ashley I. Naimi
{Department of Statistics & Data Science, Machine Learning Department}, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213. Email: [email protected] Department of Statistics & Data Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213. Email: [email protected] Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA, USA; E-mail: [email protected]
Abstract
Modern longitudinal studies collect feature data at many timepoints, often of the same order of sample size. Such studies are typically affected by dropout and positivity violations. We tackle these problems by generalizing effects of recent incremental interventions (which shift propensity scores rather than set treatment values deterministically) to accommodate multiple outcomes and subject dropout. We give an identifying expression for incremental intervention effects when dropout is conditionally ignorable (without requiring treatment positivity), and derive the nonparametric efficiency bound for estimating such effects. Then we present efficient nonparametric estimators, showing that they converge at fast parametric rates and yield uniform inferential guarantees, even when nuisance functions are estimated flexibly at slower rates. We also study the variance ratio of incremental intervention effects relative to more conventional deterministic effects in a novel infinite time horizon setting, where the number of timepoints can grow with sample size, and show that incremental intervention effects yield near-exponential gains in statistical precision in this setup. Finally we conclude with simulations and apply our methods in a study of the effect of low-dose aspirin on pregnancy outcomes. 00footnotetext: Implementation of our method is publicly available at https://github.com/kwangho-joshua-kim/Incremental-dropout
Keywords: causal inference, time-varying confounding, right-censoring, longitudinal study, positivity
1 Introduction
Causal inference has long been an important scientific pursuit, and understanding causal relationships is essential across many disciplines. However, for practical and ethical reasons, causal questions cannot always be evaluated via experimental methods (i.e., randomized trials), making observational studies the only viable alternative. Further, when individuals can be exposed to varying treatment levels over time, collecting appropriate longitudinal data is important. To that end, recent technological advancements that facilitate data collection are making longitudinal studies with a very large number of time points (sometimes of the same order of sample size) increasingly common (e.g., Kumar et al., 2013; Eysenbach et al., 2011; Klasnja et al., 2015).
The increase in observational studies with detailed longitudinal data has also introduced numerous statistical challenges that remain unaddressed. For longitudinal causal studies, two analytic frameworks are often invoked: effects of deterministic fixed interventions (Robins, 1986; Robins et al., 2000; Hernán et al., 2000), in which all individuals are assigned to a fixed exposure level over all time-points; and effects of deterministic dynamic interventions (Murphy et al., 2001; Robins, 2004) in which, at each time, treatment is assigned according to a fixed rule that depends on past history. In the real world, fixed deterministic interventions might not be of practical interest since the treatment typically cannot be applied uniformly across a population (Kennedy, 2019).
Generally, deterministic interventions (fixed or dynamic) rely on a positivity assumption, which requires every unit to have a nonzero chance of receiving each of the available treatments at every time point. If the positivity assumption is violated, the causal effect of deterministic (fixed or dynamic) interventions will be no longer identifiable. Even under positivity, longitudinal studies are especially prone to the curse of dimensionality, since exponentially many samples are needed to learn about all treatment trajectories. These issues only worsen when the number of timepoints or covariates increases. Thus, due to a lack of sufficiently flexible analytic methods for longitudinal data, researchers are often forced to either rely on strong parametric assumptions, or forego the estimation of causal effects altogether (e.g. Kumar et al., 2013).
One strategy to address such issues in deterministic interventions is to consider stochastic interventions that depend on the observational treatment process and thus are random at each timepoint (e.g., van der Laan and Petersen, 2007; Young et al., 2014; Díaz and van der Laan, 2012; Haneuse and Rotnitzky, 2013; Moore et al., 2012). Recently, Kennedy (2019) proposed novel incremental intervention effects which quantify effects of shifting treatment propensities, rather than effects of setting treatment to fixed values. Importantly, incremental effect estimators do not require positivity, and can still achieve rates with flexible nonparametric methods. Despite these strengths, the method has not yet been adapted to general longitudinal studies where multiple right-censored outcomes are common (as is common in studies with human subjects). The right-censored outcomes can result in biased estimates of incremental intervention effects unless properly adjusted. This is akin to the well-known concept of confounding bias, and will likely be amplified over time in our case. However, extension to the right censoring setup for incremental intervention effects is not straightforward as, for example, it requires computing new remainder terms to construct the estimators.
In this paper we propose a more comprehensive form of incremental intervention effects that accommodate not only time-varying treatments, but time-varying outcomes subject to right censoring (i.e., dropout). We provide an identifying expression for incremental intervention effects when dropout is conditionally ignorable, still without requiring (treatment) positivity, and derive the nonparametric efficiency bound for estimating such effects. We go on to present efficient nonparametric estimators, showing that they converge at fast rates and give uniform inferential guarantees, even when nuisance functions are estimated at much slower rates with flexible machine learning tools. Importantly, we study the variance ratio of incremental effects to more conventional deterministic effects in a novel infinite time horizon setting, where the number of timepoints can grow with sample size to infinity. We specifically show that incremental intervention effects can reduce the variance near exponentially, thus yielding extraordinary gains in statistical precision in this setup. Finally, we conduct a simulation study and show that our proposed methods can successfully adjust for subject dropout in incremental intervention effects, and apply our methods to a longitudinal study of the effect of low-dose aspirin on pregnancy outcomes.
2 Setup
We consider a study where for each subject we observe covariates , treatment , and outcome , with all variables allowed to vary over time , but where subjects can drop out or be lost to follow-up. In particular, we consider the case where we want to observe a sample of i.i.d observations from a probability distribution with, for those subjects who remain in the study up to the final timepoint ,
[TABLE]
But in general we only get to observe
[TABLE]
where is an indicator for whether the subject contributes data at time . We write as a shorthand for , so in the missingness process that we consider, subjects can drop out at each time after the measurement of covariates/treatment. This is motivated by the fact that this is likely the most common type of dropout, since outcomes at time are often measured together with or just prior to covariates at time . As we consider a monotone dropout (i.e., right-censoring) process, is non-increasing in time , i.e.,
[TABLE]
where are vectors of zeros and ones respectively. Thus our data structure is a chain with -th component
[TABLE]
for , where and we do not use or . Although we suppose each subject’s dropout will occur before the -th stage, our data structure also covers the case when the dropout will occur after the -th stage because in that case we can write
[TABLE]
as the -th component of our chain.
For simplicity, we consider binary treatment in this paper, so that the support of each is . We use overbars and underbars to denote all the past history and future event of a variable respectively, so that and for example. We also write to denote all the observed past history just prior to receiving treatment at time , with support . Finally, we use lower-case letters to represent realized values for , unless stated otherwise.
Now that we have defined our data structure we turn to our estimation goal, i.e., which treatment effects we aim to estimate. We use to denote the potential (counterfactual) outcome at time that would have been observed under a treatment sequence (note that we have as long as the future cannot cause the past). In longitudinal causal problems it is common to pursue quantities such as , i.e., the mean outcome at a given time under particular treatment sequences ; for example one might compare the mean outcome under versus , which represents how outcomes would change if all versus none were treated at all times. However identifying these effects requires strong positivity assumptions (i.e., that all have some chance at receiving every treatment at every time), and estimating these effects often requires untenable parametric assumptions especially when .
Following Kennedy (2019) we instead consider incremental intervention effects, which represent how mean outcomes would change if the odds of treatment at each time were multiplied by a factor (e.g., means odds of treatment are doubled). Incremental interventions shift propensity scores rather than impose treatments themselves; they represent what would happen if treatment were gradually more or less likely to be assigned, relative to the natural/observational treatment, in the population. Since they are ‘population-level’ effects, they are useful for giving an interpretable picture to understand the overall societal effects, but will likely be less useful than classical deterministic effects for making specific recommendations about optimal treatment. Nonetheless, there are a number of benefits of studying incremental intervention effects: for example, positivity assumptions can be entirely and naturally avoided; complex effects under a wide range of intensities can be summarized with a single curve in , no matter how many timepoints there are; and they more closely align with actual intervention effects than their fixed treatment regime counterparts. We refer to Kennedy (2019) for more discussion and details on the tradeoff between deterministic and incremental intervention effects.
Formally, incremental interventions are dynamic stochastic interventions where treatment is assigned based on new interventional propensity scores defined by
[TABLE]
not the observational propensity scores . In other words, is a shifted version of obtained by multiplying the odds of receiving treatment by . We denote potential outcomes under the above intervention as where represents a sequence of draws from the conditional distributions , . We often drop and write when the dependence is clear from the context. Note here we use capital letters for the intervention indices since they are random, as opposed to where the intervention is deterministic. Therefore in this paper, we aim to estimate the mean counterfactual outcome
[TABLE]
for any . In the next section we describe the necessary conditions for identifying in the presence of dropout.
Remark 1**.**
To be precise, the incremental effect is the compounding effect by the two different changes. Consider only the first two timepoints. In this case the propensity score under the incremental intervention at the later timepoint will be different from its observational value for two reasons: 1) multiplied to the propensity scores, and 2) covariates at the earlier timepoint that have been changed by the resultant (incremental) intervention. With many timepoints, in a long term, these effects are compounded over time and just manifested as a single number of the incremental effect. This nuance stems from the nature of incremental interventions, i.e., the way they depend on the observational treatment process through .
3 Identification
In this section, we will give assumptions under which the entire marginal distribution of the counterfactual outcome is identified. Specifically, we require the following assumptions for all .
Assumption A1**.**
* if *
Assumption A2-E**.**
A_{t^{\prime}}\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y_{t}^{\overline{a}_{t}}\mid H_{t^{\prime}},
Assumption A2-M**.**
R_{t}\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}(\underline{X}_{t},\underline{A}_{t},\underline{Y}_{t-1})\mid H_{t-1},A_{t-1},R_{t-1}=1**
Assumption A3**.**
* for some a.s.*
Assumptions (A1) and (A2-E) correspond to consistency and exchangeability (or sequential ignorability) respectively, which are commonly adopted in the literature. Consistency means that the observed outcomes are equal to the corresponding potential outcomes under the observed treatment sequence, and would be violated in settings with interference for example. Exchangeability means that the treatment and counterfactual outcome are independent, conditional on the observed past (if there were no dropout), i.e., that treatment is as good as randomized at each time conditional on the past. Experiments ensure that exchangeability holds by construction.
In our work, we additionally require assumptions (A2-M) and (A3) because of the missingness/dropout. (A2-M) is the standard time-varying missing-at-random (MAR) assumption for monotone missingness, ensuring that dropout is independent of the future conditioned on the observed history up to the current time point (e.g., Council et al., 2010; Robins et al., 1995; van der Laan and Robins, 2003). One may think of this type of MAR assumption as a sequentially random dropout process, where the decision to drop out at time is like the flip of a coin, with probability of ‘heads’ (dropout) depending only on the measurements recorded through time (Council et al., 2010, Chapter 4). This would be a reasonable assumption if we can collect enough data to explain the dropout process, so we can ensure that those who dropout look like those who do not, given all past observed data. (A3) is a positivity assumption for missingness, meaning that each subject in the study has some non-zero chance at staying in the study at the next timepoint. This would be expected to hold in many studies, but may not if some subjects are ‘doomed’ to drop out based on their specific measured characteristics.
Importantly, here we do not require positivity conditions on the propensity scores as we are targeting the effects with the incremental intervention defined in (2), not deterministic effects. The next result gives an identifying expression for under the above assumptions.
Theorem 3.1**.**
Suppose identification assumptions (A1) - (A3) hold. Then for all , the incremental effect on outcome with given value of , , equals
[TABLE]
*where , ,
, and*
[TABLE]
Here, and is some dominating measure for the distribution of .
When we derive the identification result in Theorem 3.1, as in Kennedy (2019) we use the g-formula (Robins, 1986) where we put in the incremental intervention for the treatment distribution and a point mass for the right-censoring indicator of , followed by applying the identification lemma (Lemma F.1 in the appendix) under the additional assumptions (A2-M) and (A3). The next corollary illustrates what this identification result gives in the simple point-exposure study.
Corollary 3.1**.**
When , the data structure reduces to
[TABLE]
thus in this case means the outcome is not missing. Then the identifying expression simplifies to
[TABLE]
where and .
Therefore when , the effect is simply a weighted average of the two regression functions , among those with observed outcomes, with weights depending on the propensity scores and .
4 Efficiency Theory
In the previous section, we showed that the incremental intervention effect adjusted for subject dropout can be identified without requiring any positivity conditions on the treatment process. Our main goal in this section is to develop a nonparametric efficiency theory based on the efficient influence function for .
The efficient influence function plays a crucial role in non/semiparametric efficiency theory because 1) its variance gives an asymptotic efficiency bound, and 2) its form indicates how to do an appropriate bias correction in order to construct estimators that attain such efficiency bound. Mathematically, given a target parameter an influence function acts as the derivative term in a distributional analog of a Taylor expansion, which can be seen to imply
[TABLE]
for all smooth parametric submodels containing the true distribution at , i.e., . Of all the influence functions, the efficient influence function is defined as the one which gives the greatest lower bound of all parametric submodel , so giving the efficiency bound for estimating . For more details we refer to Section D in the appendix and references therein (Bickel et al., 1998; Vaart, 1998; van der Laan and Robins, 2003; Tsiatis, 2006; Kennedy, 2016).
The next theorem gives an expression for the efficient influence function for our incremental effect , under a nonparametric model.
Theorem 4.1**.**
The (uncentered) efficient influence function for the intervention effect , , is given by
[TABLE]
where , , and
[TABLE]
for . Here , , and is a dominating measure for the distribution of .
The proof is given in Appendix F.2. This result will be used to construct an efficient, model-free estimator for our new incremental intervention effects in the next section. In Theorem 4.1 all terms are to be estimated via regression tools or simply obtained from the observed data. Note that we have new weighting terms such as that are used to adjust for dropout effects at each stage . As one may expect, should all the data be fully observed (i.e., a.e for all ), both the identifying expression and efficient influence function will reduce to the formulas presented in Kennedy (2019).
Remark 2**.**
Although we derived the above efficient influence function from first principles, based on the pathwise differentiability in (5), it could equivalently be derived using results on mapping complete- to observed-data influence functions under general coarsening at random (e.g., Robins et al., 1994, 1995; van der Laan and Robins, 2003; Tsiatis, 2006). However, in either case, computing error bounds requires the derivation of the second-order remainder terms in von Mises expansion, which is new in our work and not immediate from the earlier results.
The above efficient influence function involves three types of nuisance functions: the treatment propensity scores , the missingness/dropout propensity scores , and the psuedo outcome regression functions , . As in Kennedy (2019), each can be estimated through sequential regressions without resorting to complicated conditional density estimation, since they are marginalized versions of the full regression functions that condition on all in the past. We give the sequential regression formulation for in Appendix E.1.
The efficient influence function corresponding to follows a relatively simple and intuitive form, equaling a weighted average of the efficient influence functions for and plus contributions from the propensity scores . We give this result in Appendix E.2 as well.
5 Estimation and Inference
5.1 Proposed Estimator
In this section we develop an estimator that can attain fast rates, even when other nuisance functions are modeled nonparametrically and estimated at slower rates.
To begin, let denote the uncentered efficient influence function from Theorem 4.1, which is a function of , indexed by a set of nuisance functions
[TABLE]
, and , where are the same nuisance functions defined in Theorem 4.1.
Since , a natural estimator would be the naive plug-in -estimator
[TABLE]
where represents a set of nuisance function estimates and denotes the empirical measure so that sample averages can be written by .
If we assume and were correctly parametrically modeled, then one could use the following simple inverse-probability-weighted (IPW) estimator
[TABLE]
Note that this IPW estimator is a special case of where is set to zero for all .
However, the above inverse-weighted or plug-in -estimators typically require both strong parametric assumptions and empirical process conditions (e.g., Donsker-type or low entropy conditions) that restrict the flexibility of the nuisance estimators. Especially, the latter is due to using the data twice (once for estimating the nuisance functions, again for estimating the bias, i.e., the average of the uncentered influence function), thus can cause overfitting. To avoid this downside and make our estimator more practically useful, here we use sample splitting (Zheng and Laan, 2010; Chernozhukov et al., 2016; Kennedy, 2019; Robins and Hernán, 2008). As will be seen shortly, sample splitting allows us to avoid complex empirical process conditions even when all the nuisance functions are arbitrarily flexibly estimated. Further, bias-corrected influence function-based estimators allow us to withstand slower rates for nuisance estimation while attaining faster rates for estimation of the parameter of interest.
Now we give an algorithm allowing slower than rates and non-Donsker complex nuisance estimation as follows. First, we randomly split the observations into disjoint groups, using a random variable , , drawn independently of the data, where each denotes the group membership for unit . Then our proposed estimator is given by
[TABLE]
where we let denote sample averages only over a group , i.e., , and let denote the nuisance estimator constructed excluding the group . We detail exactly how to compute the proposed estimator in Appendix A.
Our methods effectively utilize all the observed samples available at each time, without any need for discarding a subset of observed sample in advance. It is also worth noting that our algorithm is amenable to parallelization due to the sample splitting.
5.2 Asymptotic Theory
This subsection is devoted to characterizing an asymptotic behavior of our proposed estimator, that is -consistent and asymptotically normal even when the nuisance functions are estimated nonparametrically at much slower than rates.
In what follows we denote the norm of function by , to distinguish it from the ordinary norm for a fixed vector. Also note that although we used to denote the pseudo-regression function defined in Theorem 4.1, in principle they are indexed by both the time and increment parameter as in . The next theorem shows uniform convergence of , which lays the foundation for subsequent statistical inferential and testing procedures.
Theorem 5.1**.**
Define the variance function as and let denote its estimator. Assume:
The set is bounded with .
- 2)
, , for some constant .
- 3)
\sup_{\delta\in\mathcal{D}}\big{|}\frac{\hat{\sigma}^{2}(\delta,t)}{\sigma^{2}(\delta,t)}-1\big{|}=o_{\mathbb{P}}(1), and .
- 4)
\left(\underset{\delta\in\mathcal{D}}{sup}\|m_{\delta,s}-\widehat{m}_{\delta,s}\|+\|\pi_{s}-\widehat{\pi}_{s}\|\right)\Big{(}\|\widehat{\pi}_{r}-{\pi}_{r}\|+\|\widehat{\omega}_{r}-{\omega}_{r}\|\Big{)}=o_{\mathbb{P}}\left(\frac{1}{\sqrt{n}}\right), .
Then we have
[TABLE]
*in , where is a mean-zero Gaussian process with covariance
and .*
A proof of the above theorem can be found in Appendix F.5. We also analyze the second order remainders for the efficient influence function, and keep the intervention distribution completely general (see Lemma F.2, F.7, F.8 in the appendix). Therefore, one may apply our results to studies of other stochastic interventions under missingness/dropout as well.
Assumptions 1), 2) and 3) in Theorem 5.1 are all quite weak. Assumptions 1) and 2) are mild boundedness conditions, where assumption 2) could be further relaxed at the expense of a less simple proof, for example using bounds on norms. Assumption 3) is also a mild consistency assumption, with no requirement on rate of convergence. The main substantive assumption is Assumption 4), which requires that the product of nuisance estimation errors must vanish at fast enough rates. One sufficient condition for this is that all the nuisance functions are consistently estimated at a rate of or faster.
Lowering the bar from to indeed allows us to employ a richer set of modern machine learning tools, since such rates are attainable under diverse structural constraints (e.g., Yang et al., 2015; Raskutti et al., 2012; Györfi et al., 2006). In this paper, however, we are agnostic about how such rates should be attained. In practice, we may want to consider using different estimation techniques for each of based on our prior knowledge and descriptive information, or use ensemble learners.
Based on the result in Theorem 5.1, we can construct pointwise confidence intervals for as
[TABLE]
where is the variance estimator defined in Theorem 5.1. Following Kennedy (2019), one may use the multiplier bootstrap for uniform inference, by replacing the critical value with satisfying
[TABLE]
We refer to Kennedy (2019) for details on how to construct via the multiplier bootstrap.
6 Infinite Time Horizon Analysis
The great majority of causal inference literature considers a finite time horizon where the number of timepoints is small and fixed, or even just equal to one, a priori ruling out much significant (if any) longitudinal structure. However, in practice more and more studies accumulate data across very many timepoints, due to ever increasing advances in data collection technology. In fact, in many applications can even be comparable to or larger than sample size . This renders most of the classical methods based on finite time horizons obsolete, as their theoretical results/analysis have not been validated in such time horizon where can grow to infinity. For example, Kumar et al. (2013) describe how new mobile and wearable sensing technologies have revolutionized randomized trials and other health-care studies by providing data at very high sampling rates (10-500 times per second). Klasnja et al. (2015); Qian et al. (2020) use timepoints in their study of micro-randomized trials for evaluating just-in-time adaptive interventions via mobile applications. As we collect more granular and fine-grained data, some recent studies explore efficient off-policy estimation techniques in infinite-time horizon settings (e.g., Liu et al. (2018) in reinforcement learning). Interestingly, though, there has been no formal analysis for general longitudinal studies.
Therefore in this section we analyze the behavior of the IPW version of our proposed estimator (relative to the standard IPW estimator in classical deterministic settings), in a more realistic regime where can scale with sample size. To the best of our knowledge, this is one of the first such infinite-horizon analyses in causal inference, outside of some recent similarly specialized examples involving dynamic treatment regimes (Laber et al., 2018; Ertefaie and Strawderman, 2018). Specifically, we study the variance ratio bound, and show how deterministic effects are afflicted by an inflated variance relative to incremental intervention effects as grows.
We proceed with comparing the variances of estimators of the deterministic effect for the always-treated (receiving treatment at every timepoint) versus the incremental effect for . For simplicity and concreteness, in what follows we consider a simple setup where the propensity scores are all equal to (i.e., for all ) and there is no dropout (i.e. a.e. for all ). This makes the pseudo-regression functions in Theorem 4.1 equal to zero. In this setup we have unbiased estimators of the always-treated effect and the incremental effect given by
[TABLE]
and
[TABLE]
respectively, where . In the next theorem, we analyze the variance ratio of the two estimators and show that one can achieve near-exponential precision gains by targeting .
Theorem 6.1**.**
Consider the estimators and conditions defined above. Further assume that for some constant and . Then for any ,
[TABLE]
where and for any fixed value of such that \frac{1}{1-p^{T}{\left(\mathbb{E}\left[Y^{\overline{\bm{1}}_{T}}\right]\right)^{2}}\big{/}{\mathbb{E}\left[\left(Y^{2}\right)^{\overline{\bm{1}}_{T}}\right]}}\leq{c}.
The proof of the above theorem is given in Appendix F.3 and is based on the similar logic used in deriving the g-formula (Robins, 1986). Note that we only require two very mild assumptions in the above theorem: the boundedness assumption on , and , which is equivalent to saying is a non-degenerate random variable. In the proof, we give a more general result for any sequence as well.
Theorem 6.1 allows us to precisely quantify the relative statistical certainty in estimating the two effects. Specifically, since for and is bounded (and converging to one monotonically), the variance ratio decays exponentially in . This implies that we may reap extraordinary gains in statistical precision from targeting instead of , when we intend on incorporating substantial number of timepoints in the study. The same goes for effects for the never-treated versus the incremental interventions with (see Appendix F.3).
Remark 3**.**
The variance ratio we study in this section is somewhat distinct from usual relative efficiency, since here we are considering two different (but closely related) target parameters. However, when we are indifferent about the inferential target, the variance ratio still can serve as a useful guidance in selecting an estimator. As , the gap between two effects monotonically shrinks to zero and the two target parameters and become eventually identical, so the variance ratio goes to . On the other hand, for finite , the two effects are not quite the same, but how much one versus the other is of more interest is debatable ( could indeed be more preferable if we aim to describe how outcomes would vary with more practical gradual changes in treatment intensity). If we do not have a strong reason to prefer one effect over the other, we could choose the one with smaller variance in favor of improved statistical precision. This issue also arises for local effects under positivity violations, instrumental variables, etc. (e.g., Imbens, 2014; Aronow, 2016; Crump et al., 2009), where an estimand is adaptively chosen on the basis of smaller variance.
In what follows we refine Theorem 6.1 so that one can characterize the minimum number of timepoints to guarantee a smaller variance for .
Corollary 6.1**.**
There exists a finite number such that
[TABLE]
for every , where is never greater than
[TABLE]
The proof is given in Appendix F.4. The proof of the above corollary relies upon the fact that can be represented as a variance of the weighted sum of the IPW estimators for , (see Lemma F.6 in the appendix).
Remark 4**.**
It may be possible to further tighten the upper bound for , but considering the focus of our paper this would not be very illuminating and practically meaningful, since the value of in the above corollary is already quite small in general. To illustrate, consider , , and an extreme case of (i.e., is mostly concentrated around [math]). Then . If we use , then .
Theorem 6.1 and Corollary 6.1 can be generalized to the case of observational studies where the nuisance functions need to be estimated, but our view is that the simple case captures the main ideas and the general case would only add complexity.
To empirically assess the validity of Theorem 6.1, we conduct two simple simulation studies as below.
Simulation 1 (Randomized Trial). We set and let truncated at two standard deviations. Based on this data generation process, given a value of , we generate 100 different datasets for , , where we make sure the positivity assumption is valid in our simulation 111This is done in a similar spirit to Laplace smoothing in Naive Bayes.. Then we compute the sample variance of each estimator and their ratio correspondingly, and present them in Figure 1.
Simulation 2 (Observational Study). Although not directly addressed in Theorem 6.1, here we also consider the setting for observational studies. Specifically, we consider a model
[TABLE]
[TABLE]
[TABLE]
for all , where we let , and let denote the inverse logit function. In this simulation, we assume that it is more (less) likely to receive a treatment if a subject has (not) received treatments recently. The rest of the specification remains the same as Simulation 1. The result is presented in Figure 2.
The simulation results support our theoretical results. Overall, the result in this section provides crucial insight into the longitudinal study with many timepoints, suggesting that massive gains in statistical certainty are possible by studying incremental rather than classical deterministic effects.
7 Experiments
7.1 Simulation Study
In this section we explore finite-sample performance of the proposed estimator via synthetic simulation. We consider the following data generation model
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
for . is a constant used to control an average amount of dropout units. In this setup we assume that the more likely subjects have been treated, the more likely they will receive the treatment in the next timepoint in general. Moreover, the dropout probability at each is largely driven by the sign of : the dropout probability will be low (high) if . Therefore, although each is designed to have a symmetric, bimodal distribution with a mean of [math], the value of for surviving subjects will tend to be much greater than [math]. Due to the way also interact with the outcome in the above model, discarding all the subjects that have dropped out should lead to an upward-biased estimate of the incremental intervention effects. In Appendix B, we provide auxiliary figures for the sake of better understanding of our simulation. Variables akin to that considerably affect both outcome and dropout are commonly found in practice (e.g., side effects).
We estimate the incremental effect at . We compare our proposed estimator () with three baseline methods: the naive Z-estimator () and the IPW estimator (), both of which are defined in Section 5.1, and the original incremental-effect estimator () proposed by Kennedy (2019). Note that for using we have to discard samples that have ever dropped out, whereas in other estimators surviving subjects are properly re-weighted for dropout adjustment at each timepoint. Since finite-sample properties of were already extensively explored in Kennedy (2019), here we primarily focus on the effect of dropout in the longitudinal setting.
To estimate nuisance parameters, following Kennedy (2019) we form an ensemble of some widely-used nonparametric models. Specifically, we use the cross-validation-based superleaner ensemble algorithm (Van der Laan et al., 2007) via the SuperLearner package in R to combine support vector machine, random forest, k-nearest neighbor regression. For and , we use -fold sample splitting.
We repeat simulation times in which we draw samples each simulation. We use values of equally spaced on the log-scale within . As in Kennedy (2019), performance of each estimator is assessed via integrated bias and root-mean-squared error (RMSE) defined by
[TABLE]
where and are the estimate and true value of the target parameter respectively, for -th simulation and . We present the results in Table 1.
As shown in Table 1, when there is a substantial amount of subject dropout, shows much worse performance than all the other dropout-adjusted estimators, which is expected by the design of our data generation model. However, this gap shrinks as dropout rates decrease. Also in each setting, the proposed estimator appears to perform better and more markedly improve with sample size than and . This behavior is indicative of the validity of our theory that is not only able to adjust for the dropout process, but more efficient than and .
7.2 Application
Here we illustrate the proposed methods in analyzing the Effects of Aspirin on Gestation and Reproduction (EAGeR) data, which evaluates the effect of daily low-dose aspirin on pregnancy outcomes and complications. The EAGeR trial was the first randomized trial to evaluate the effect of pre-conception low-dose aspirin on pregnancy outcomes (Schisterman et al., 2014; Mumford et al., 2016). However, to date this evidence has been limited to intention-to-treat analyses.
The design and protocol used for the EAGeR study have been previously documented (Schisterman et al., 2013). Overall, 1,228 women were recruited into the study (615 aspirin, 613 placebo) and 11% of participants chose to drop out of the study before completion. Roughly 43,000 person-weeks of information were available from daily diaries, as well as study questionnaires, and clinical and telephone evaluations collected at regular intervals over follow-up. The dataset is characterized by a substantial degree of non-compliance (more than 50% at the end of the study), and thereby is susceptible to positivity violation.
We used our incremental propensity score approach to evaluate the effect of aspirin on pregnancy outcomes in the EAGeR trial, accounting for time-varying exposure and dropout. We let each variable become a constant equal to its final value after the time point at which no more actual data is collected on the subject, so we have balanced panel data as described in (1). Here, the study terminates at week 89 (). We use 24 baseline covariates (e.g., age, race, income, education, etc.) and 5 time-dependent covariates (compliance, conception, vaginal bleeding, nausea and GI discomfort). is a binary treatment variable coded as if a woman took aspirin at time and [math] otherwise. indicates that the woman is observed in the study at time . Lastly, is an indicator of having a pregnancy outcome of interest at time . We are particularly interested in two types of pregnancy outcomes: live birth and pregnancy loss (fetal loss). We perform separate analysis for each of the two cases.
For comparative purposes, we estimate the simple complete-case effect
[TABLE]
which relies on both non-compliance and drop-out being completely randomized. The value of is 0.052 (5.2%) for live birth and 0.012 (1.2%) for pregnancy loss, both of which are close to the intention-to-treat estimates reported in Schisterman et al. (2013, 2014).
We give a brief discussion on why standard modeling approaches fail here. We found strong evidence of positivity violations in the EAGER dataset; as shown in Figure 7 in Appendix C, the average propensity score quickly drops to zero as grows. This suggests that very few patients follow the given protocol of taking aspirin late in the study. Thus, it is unrealistic to use an intervention where all participants would take aspirin at every time, as required in many standard models including the popular marginal structural models (MSMs) (Robins et al., 2000). In fact, when we modeled the effect curve by so that the coefficient for exposure can vary with time, then the standard inverse-weighted MSM fits failed and no coefficient estimates were found even for moderate values of (see Figure 7-(b) in Appendix C for a closer look). This positivity violation precludes other standard approaches for time-varying treatments as well.
One quick remedy could be to move away from standard ATEs and instead only estimate the mean outcome if no one were treated, comparing to the observed outcome (not all versus none as in the ATE). Then we can apply some other nonparametric approaches available in the literature for estimating this one-sided counterfactual. When we use the g-computation (plug-in) estimator (Robins, 1986), the result seems to suggest that the mean outcome if no one received aspirin is worse than the observed (Figure 8 in Appendix C). However, when we use the sequential doubly robust (SDR) estimator (Luedtke et al., 2017), the huge overlap between 95% CI intervals prevents us from drawing any firm conclusion (Figure 9 in Appendix C).
Now, we estimate the incremental effect curve , which represents the probability of having live birth or pregnancy loss at the end of the study () if the odds of taking aspirin for all women were increased by a factor of at all timepoints, across different values of . Again, we use the cross-validated superleaner algorithm (Van der Laan et al., 2007) to combine support vector machine, random forest, and k-nearest neighbor regression to estimate a tuple of nuisance functions at each . We proceed with sample splitting with splits, and use 10,000 bootstrap replications to compute pointwise and uniform confidence intervals. Results are shown in Figure 3.
The estimated curve in Figure 3 appears to be almost flat for live birth, and have a slightly negative gradient with respect to (odds ratio) for pregnancy loss. However, at level we fail to reject the null of no incremental intervention effects for both cases (as both confidence bands contain a horizontal line). This is mainly due to the noncompliance of aspirin takers that makes the bands too wide for large regimes. Thus, our analysis yielded a similar result to the previous findings of Schisterman et al. (2014), indicating that use of low dose aspirin was not significantly associated with live birth or pregnancy loss. Nonetheless, the estimated incremental intervention effects provide more detailed information with greater nuance, requiring none of the parametric and positivity assumptions.
Remark 5**.**
In this analysis we have looked into the intervention that does tell us about the effect of overall increase or decrease in treatment at each , but not the optimal timing of treatment (e.g., when aspirin should be prescribed since conception). As pointed out by Kennedy (2019), one could address such timing issues by considering depending on time and covariate history, which will bring added complexity. We leave this to our future work.
8 Discussion
Incremental interventions are a novel class of stochastic dynamic intervention where positivity assumptions can be completely avoided. However, they had not been extended to repeated outcomes, and without further assumptions do not give identifiability under dropout, both of which are very common in practice. In this paper we solved this problem by showing how incremental intervention effects are identified and can be estimated when dropout occurs (conditionally) at random. Even in the case of many dropouts, our proposed method efficiently uses all the data without sacrificing robustness. We gave an identifying expression for incremental intervention effects under monotone dropout, without requiring any positivity assumptions. We established general efficiency theory and constructed the efficient influence function, and presented nonparametric estimators which converge at fast rates and yield uniform inferential guarantees, even when all the nuisance functions are estimated with flexible machine learning tools at slower rates. Furthermore, we analyzed the variance ratio of incremental intervention effects to conventional deterministic dynamic intervention effects in a novel infinite time horizon setting in which the number of timepoints can possibly grow with sample size, and showed that incremental intervention effects can yield near-exponential gains in statistical precision. Finally, we showed that the proposed methods can effectively mitigate the bias caused by subject dropout via the simulation study, and applied the methods in study of the effect of low-dose aspirin on pregnancy outcomes.
There are a number of avenues for future work. The first is application to other substantive problems in medicine and the social sciences. For example, in a forthcoming paper we analyze the effect of aspirin on pregnancy outcomes with more extensive data. It will also be important to consider other types of non-monotone missingness where the standard time-varying MAR assumption A2-M may not be appropriate (Sun and Tchetgen, 2014; Tchetgen et al., 2016). We expect that our approach can be extended to other important problems in causal inference; for example, one could develop incremental intervention effects for continuous treatments and instruments (Kennedy et al., 2017, 2019), or for mediation in the same spirit as (Díaz and Hejazi, 2019), but generalized to the longitudinal case with dropout. Developing incremental-based sensitivity analyses for the longitudinal MAR assumption would also be an interesting extension.
Acknowledgement
Edward Kennedy and Ashley Naimi gratefully acknowledge financial support from the NSF (Grant # DMS1810979) and NIH (Grant # R01HD093602) for this research, respectively. This work was also supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institutes of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland, contract numbers HHSN267200603423, HHSN267200603424, and HHSN267200603426. We are also grateful for useful comments from two anonymous referees. This work was completed while Kwangho Kim was a PhD student at Carnegie Mellon University.
Appendix A Algorithm
An algorithm detailing how to compute the proposed estimator (6) at is given in Algorithm 1 as below.
Appendix B Auxiliary figures for the simulation study
We provide some auxiliary figures to help readers better understand the simulation setup and result presented in Section 7.1 using a random example. Figures 4 and 5 illustrate how the dropout process may induce a large upward bias in estimation of incremental effects as shown in Figure 6. Figure 6 also shows our methods can successfully adjust for dropout. All the results in this example are measured at with the dropout rate of 52.5%.
Appendix C Alternative approaches for the EAGeR data analysis
Here, we discuss why standard approaches fail for our analysis of the EAGER dataset in Section 7.2 of the main text. For comparative purposes, we alter our target effect and then apply some other nonparametric approaches available in the literature. Then we compare the result with the one we obtained in Section 7.2.
C.1 Why standard model fails: positivity violation
All the standard models dealing with time-varying treatments, except on very rare occasions, require treatment positivity. However, as will be elaborated below, positivity is likely violated in the EAGER dataset. Many individuals turned out not to follow the given protocol of taking aspirin and this non-compliance only exacerbates over time. To illustrate this, we present the average propensity score over time in Figure 7-(a). As shown in Figure 7-(a), the average propensity score quickly drops to zero as grows. In other words, Figure 7-(a) implies that it would be hard to imagine having all of the study participants take aspirin at each time.
Even if positivity is only nearly violated, it can pose a serious problem in attempting to estimate our target causal effect. One of the most widely-used approaches to handle time-varying treatments is marginal structural models (MSMs) (Robins et al., 2000). In practice, MSMs are often estimated via inverse probability weighting (IPW). The following quantity appears in the IPW (also in the doubly robust) moment condition
[TABLE]
for any choice of (with matching dimensions) where . However, Figure 7-(b) indicates that on average a cumulative product of propensity score sharply drops to zero even with moderate . This would make standard estimation techniques such as IPW to fail as easily blows up.
More specifically, when we parametrically model the effect curve by so that the coefficients for exposure can vary with time, an inverse-weighted MSM estimator that is the solution to
[TABLE]
indeed fails since no coefficient estimates can be found in the above equation even for moderate values of , e.g., . Thus, it appears that positivity violation in our dataset precludes the standard MSM-based approach. We remark that these limitations are not at all unique to the analysis of our EAGeR dataset, but also common to many observational studies based on the MSM or other approaches (e.g., Luedtke et al., 2017).
C.2 Alternative approach
Due to the positivity violation, the estimation results, if any, through standard approaches will remain dubious at best. Therefore, we alter our target contrast from the standard ATE to the mean outcome we would have observed in a population if “observed” versus none (not all versus none) were treated, which is defined by
[TABLE]
where denotes an observed history of aspirin consumption. This new estimand would tell us how the mean outcome would have changed if no one in the population had taken aspirin throughout the study. In this way, we can avoid estimating the problematic counterfactual \mathbb{E}\big{[}Y^{\bar{A}_{T}=\bar{\mathbf{1}},\bar{R}_{T}=\bar{\mathbf{1}}}\big{]}. However, by construction this solution entails the fundamental limitation because we have sacrificed the causal effect of original interest.
In order to estimate our new causal parameter (7), here we use the g-computation 222We also tried a weighting estimator but omitted the result here, since it gives almost the same result as the g-computation, only with wider confidence bands. (plug-in) estimator (Robins, 1986) and the sequential doubly robust (SDR) estimator proposed by Luedtke et al. (2017) which allows right-censored data structures.
C.3 Estimation and inference
Estimation. First for the g-computation estimator, we estimate the following g-formula
[TABLE]
via plug-in estimators of the pseudo-outcome regression function each time step. Next, for the SDR estimator, we tailor Algorithm 2 of Luedtke et al. (2017) for our right-censored data structures (everything remains the same except that we add the condition on each pseudo-outcome regression function). For both methods, we use the same nonparametric ensemble we used in Section 7.2.
Inference. Confidence intervals are estimated by bootstrapping at 95% level for both of the estimators. Note that for the SDR estimator, we are guaranteed to consistently estimate standard errors (pointwisely) by bootstrapping due to the following asymptotically property,
[TABLE]
for all , where is the influence function of . However, this is no longer guaranteed for the g-computation estimator.
C.4 Result
For the sake of completeness, we estimate each for all and present the cumulative effects over time . The results for the g-computation and the SDR estimators are presented in Figures 8, 9, respectively.
The result based on the g-computation estimator in Figure 8 shows that the counterfactual mean outcomes for never-takers (individuals who have never taken aspirin throughout the study) are worse-off than the observed. Specifically, for the never-takers the probability of having live birth has been decreased and the probability of having fetal loss has increased. The result seems to be statistically significant at .
On the other hand, the result based on the SDR estimator in Figure 9 indicates that although the mean effects for the never-takers still appear to be worse off than the observed, they look no longer statistically significant. Hence in this case we cannot draw any firm conclusion about the effect of aspirin on pregnancy outcome.
It might be tempting to take the results from Figure 8 as it seems to deliver more clear messages. However, we do not know if our variance estimates there are correct in the first place. Accuracy of our estimate is further afflicted by the moderate sample size (=1024) due to the slow convergence rates of the plug-in regression. These issues can be mitigated in the SDR estimator. Thus, we should rather resort to the results presented in Figure 9, which basically tells us that the effect of low-dose aspirin is insignificant and remains dubious, based on the causal effect defined in (7).
After all, it should be noted that due to the positivity violation we end up limiting ourselves to the more narrow notion of causal effects (i.e. observed versus none) which is different from the ATE type estimands that are typically of utmost interest for policy makers. The causal effect in (7) might not be practically meaningful as to aspirin prescription for pregnant since we are in general much more interested in the always-taker group than the never-taker group.
Appendix D More details on influence functions and efficiency bound
Here, we shall introduce the influence function, which is a foundational object of statistical theory that allows us to characterize a wide range of estimators with favorable theoretical properties. There are two notions of the influence function: one for estimators and the other for parameters. To distinguish these two cases we will call the latter, which corresponds to parameters, influence curves as in for example, Boos and Stefanski (2013); Kennedy (2016) 333However, the terms ‘influence curve’ and ‘influence function’ are used interchangeably in many cases.. Before we go on, we declare that the primary sources of this section are Kennedy (2014, 2016, 2020) from which all the terms, definitions and results are directly borrowed.
First, we give a definition of influence curves. It was first introduced by Hampel (1974) and studied to provide a general solution to find approximation-by-averages representation for a functional statistic. We only consider nonparametric models here.
Suppose that we are given a target functional . For a nonparametric model , let , denote a smooth parametric submodel for with . A typical example of this parametric submodel can be given by for some mean-zero, uniformly bounded function . Then the influence curve for parameter is defined by any mean-zero, finite-variance function that satisfies the following pathwise differentiability,
[TABLE]
The above pathwise differentiability implies that our target parameter is smooth enough to admit a von Mises expansion: for two distribution
[TABLE]
where is a second-order remainder. Therefore, the influence curve corresponds to the functional derivative in a Von Mises expansion of .
One can obtain the classical Cramér-Rao lower bound for each parametric submodel ; the Cramér-Rao lower bound for is where \psi^{\prime}(\mathbb{P}_{\epsilon})=\frac{\partial}{\partial\epsilon}\psi(\mathbb{P}_{\epsilon})\big{|}_{\epsilon=0} and s_{\epsilon}=s_{\epsilon}(z)=\frac{\partial}{\partial\epsilon}\log d\mathbb{P}_{\epsilon}\big{|}_{\epsilon=0}. The asymptotic variance of any nonparametric estimator is no smaller than the supremum of the Cramér-Rao lower bounds for all parametric submodel, and it is known that under the above pathwise differentiability condition the greatest such lower bound is given by
[TABLE]
Hence, is the nonparametric analog of the Cramér-Rao lower bound, and we call the influence curve that attains the above bound the efficient influence curve. The efficient influence curve gives the efficiency bound for estimating . In parametric models, more than one influence curves may exist. On the other hand in nonparametric model, the influence curve is unique. However, the efficient influence curve is always unique in any cases.
Once the efficient influence curve is known, no estimator can be more efficient than such that
[TABLE]
as serves to be our nonparametric efficiency bound. In (10), we call the (efficient) influence function for the estimator 444In fact, influence curves themselves are the putative influence functions.. For each nonparametric estimator, the efficient influence function, if exists, is almost surely unique, so in this sense the influence function contains all information about an estimator’s asymptotic behavior. In other words, if we know the influence function for an estimator, we know its asymptotic distribution and can easily construct confidence intervals and hypothesis tests.
Characterizing the influence curves is crucial not only to give the efficiency bound for estimating , thus providing a benchmark against which estimators can be compared, but probably more importantly, to construct estimators with very favorable properties, such as double robustness or general second-order bias. One may can find an (asymptotically linear) estimator that satisfies (10) by solving appropriate estimating equations using the influence curves. Section F.2 of the appendix contains an example of developing an efficient, model-free estimator based on the efficient influence curve of the target parameter.
Finally we remark that for complicated functionals, pretending discrete space on can facilitate our procedure to characterize influence curves. For example, assuming that our unit space is discrete, the influence curve for the functional can be defined by
[TABLE]
where we let be the Dirac measure at . This definition is equivalent to the Gateaux derivative of at in direction of point mass (see, for example, Chapter 5 in Boos and Stefanski (2013)).
For more details for nonparametric efficiency theory and influence functions, we refer to Kennedy (2014, 2016, 2020); van der Laan and Robins (2003); Tsiatis (2006).
Appendix E Additional Technical Results
E.1 Sequential regression formulation
The efficient influence function derived in the previous subsection involves pseudo-regression functions . To avoid complicated conditional density estimation, as suggested by Kennedy (2019), one may formulate a series of sequential regressions for , as described in the subsequent remark.
Remark 6**.**
From the definition of , it immediately follows that
[TABLE]
Hence, we can find equivalent form of the functions in Theorem 4.1 as the following recursive regression:
[TABLE]
for , where we use shorthand notation and .
Above sequential regression form is practically useful since it allows us to bypass all the conditional density estimations and instead use regression methods that are more readily available in statistical software.
E.2 EIF for
In the next corollary we provide the efficient influence function for the incremental effect for a single timepoint study () whose identifying expression is given in Corollary 3.1.
Corollary E.1**.**
When , the efficient influence function for in Corollary 3.1 is given by
[TABLE]
where
[TABLE]
[TABLE]
[TABLE]
and
[TABLE]
which is the uncentered efficient influence function for .
The efficient influence function for the point exposure case has a simpler and more intuitive form. As stated in Corollary E.1, it is a weighted average of the two efficient influence functions , plus a contribution term due to unknown propensity scores.
Appendix F Proofs
F.1 Lemma for the identifying expression in Theorem 3.1
Without assumptions (A2-M) and (A3), our target parameter would not be identified. The following lemma extends Theorem 1 in Kennedy (2019) to our setting.
Lemma F.1**.**
Under (A2-M) and (A3), and for all , we have following identities:
- a.
**
- b.
**
- c.
**
Proof.
- a.
By abuse of notation, for , here we let , represent , respectively, and represent . First note that
[TABLE]
where the first equality follows by definition, the second by definition of conditional probability, the third by assumption (A2-M), the fourth again by definition of conditional probability, the fifth by assumption (A2-M), and the sixth by repeating the same step times. The last expression is obtained by simply rearranging terms using the definition of conditional probability.
Now we let
[TABLE]
so we can write .
Then, similarly we have
[TABLE]
Hence, finally we obtain
[TABLE]
where the second equality comes from the above results. The proof naturally leads to . 2. b.
By definition , and from the part a) it immediately follows
[TABLE]
Hence, we have
[TABLE]
which yields the desired result. 3. c.
By definition and thereby it suffices to show that .
By the same logic we used for the first proof, we have
[TABLE]
and also
[TABLE]
Hence, by Assumption (A2-M) we have that
[TABLE]
∎
Following the exact same logic used in the proof of Kennedy (2019, Theorem 1), under Assumptions A1 and A2-E, for all we have the recursion formula
[TABLE]
Applying the above times leads to
[TABLE]
Finally, Assumption A1 and Lemma F.1 give
[TABLE]
F.2 Proof of Theorem 4.1
F.2.1 Identifying expression for the efficient influence function
In the next lemma, we provide an identifying expression for the efficient influence function for our incremental effect under a nonparametric model, which allows the data-generating process to be infinite-dimensional.
Lemma F.2**.**
Define
[TABLE]
for , , where we write and . For and , we set and . Moreover, let denote the efficient influence function for .
Then, the efficient influence function for is given by
[TABLE]
where we define , , and , and is a dominating measure for the distribution of .
The proof of Lemma F.2 involves derivation of efficient influence function for more general stochastic interventions that depend on the both observational propensity scores and right-censoring process. We begin by presenting the following three additional lemmas.
Lemma F.3** (Kennedy (2019)).**
For , the efficient influence function for
[TABLE]
which is defined in (2) is given by , where equals
[TABLE]
where .
Lemma F.4**.**
Suppose is not depending on . Recall that for ,
[TABLE]
for , where we write and . Note that from definition of it immeidately follows .
Now the efficient influence function for is
[TABLE]
where we define , , and .
Lemma F.5**.**
Suppose depends on and let denote the efficient influence function for defined in Lemma F.3 for all . Then the efficient influence function for is given as
[TABLE]
where is the efficient influence function from Lemma F.4 and is a dominating measure for the distribution of .
The proof of Lemma F.3, F.4 and F.5 is basically a series of chain rules, after specifying efficient influence functions for terms that repeatedly appear. We provide a brief sketch for the proof of Lemma F.4 and F.5 below, which can be easily extendable to the full proof. This also could be useful to develop other results for more general stochastic interventions.
Proof of Lemma F.4 and Lemma F.5
Let denote a map to the efficient influence function for a functional . First, without proof, we specify efficient influence functions for mean and conditional mean which serve two basic ingredients for our proof. For mean value of a random variable , we have
[TABLE]
and for conditional mean with a pair of random variables when is discrete, we have
[TABLE]
These results can be obtained by applying (8) or (11).
Proof.
It is sufficient to prove for since it is straightforward to extend the proof for arbitrary by induction. For , it is enough to compute the following four terms.
- A)
\begin{aligned} &\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mathcal{IF}\Big{(}\mu(h_{2},a_{2},R_{3}=1)\Big{)}\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)d\mathbb{P}(y_{s-1},x_{s}|h_{s-1},a_{s-1},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\frac{\mathbbm{1}\{(H_{2},A_{2},R_{3})=(h_{2},a_{2},1)\}}{d\mathbb{P}(h_{2},a_{2},R_{3}=1)}\Big{\{}Y-\mu(h_{2},a_{2},R_{3}=1)\Big{\}}\\ &\qquad\qquad\ \times\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)d\mathbb{P}(y_{s-1},x_{s}|h_{s-1},a_{s-1},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mathbbm{1}\big{\{}(H_{2},A_{2},R_{3})=(h_{2},a_{2},1)\big{\}}\big{\{}Y-\mu(h_{2},a_{2},R_{3}=1)\big{\}}\\ &\qquad\qquad\ \times\prod_{s=1}^{2}\frac{dQ_{s}(a_{s}\mid h_{s},R_{s}=1)}{d\mathbb{P}(a_{s}\mid h_{s},R_{s}=1)}\frac{1}{d\mathbb{P}(R_{s+1}=1\mid h_{s},a_{s},R_{s}=1)}\\ &=\{Y-\mu(H_{2},A_{2},R_{3}=1)\}\mathbbm{1}(R_{3}=1)\prod_{s=1}^{2}\frac{dQ_{t}(A_{s}\mid H_{s},R_{s}=1)}{d\mathbb{P}(A_{s}\mid H_{s},R_{s}=1)}\frac{1}{d\mathbb{P}(R_{s+1}=1\mid H_{s},A_{s},R_{s}=1)}\end{aligned}
- B)
\begin{aligned} &\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)\mathcal{IF}\Big{(}d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\Big{)}d\mathbb{P}(h_{1})\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)\frac{\mathbbm{1}\big{\{}(H_{1},A_{1},R_{2})=(h_{1},a_{1},1)\big{\}}}{d\mathbb{P}(h_{1},a_{1},R_{2}=1)}\\ &\qquad\qquad\ \times\Big{\{}\mathbbm{1}(Y_{1}=y_{1},X_{2}=x_{2})-d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\Big{\}}d\mathbb{P}(h_{1})\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)\\ &\qquad\qquad\ \times\frac{\mathbbm{1}\big{\{}(H_{1},A_{1},R_{2})=(h_{1},a_{1},1)\big{\}}\big{\{}\mathbbm{1}(Y_{1}=y_{1},X_{2}=x_{2})-d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\big{\}}}{d\mathbb{P}(R_{2}=1|h_{1},a_{1})d\mathbb{P}(a_{1}|h_{1})d\mathbb{P}(h_{1})}\\ &\qquad\qquad\ \times d\mathbb{P}(h_{1})\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid h_{2},R_{2}=1)\mathbbm{1}\big{\{}(H_{1},A_{1},R_{2})=(h_{1},a_{1},1)\big{\}}\\ &\qquad\qquad\ \times\big{\{}\mathbbm{1}(Y_{1}=y_{1},X_{2}=x_{2})-d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\big{\}}\frac{dQ_{1}(A_{1}\mid H_{1})}{d\mathbb{P}(A_{1}\mid H_{1})}\frac{1}{d\mathbb{P}(R_{2}=1\mid H_{1},A_{1})}\\ &=\Bigg{\{}\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}\setminus\mathcal{H}_{2}}\mu(H_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid H_{2},R_{2}=1)\\ &\qquad-\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}\setminus\mathcal{H}_{1}\times\mathcal{A}_{1}}\mu(h_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid h_{2},R_{2}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\Bigg{\}}\\ &\qquad\times\mathbbm{1}(R_{2}=1)\frac{dQ_{1}(A_{1}\mid H_{1})}{d\mathbb{P}(A_{1}\mid H_{1})}\frac{1}{d\mathbb{P}(R_{2}=1\mid H_{1},A_{1})}\\ &=\Bigg{\{}\int_{\mathcal{A}_{2}}\mu(H_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid H_{2},R_{2}=1)-m_{1}(h_{1},a_{1},R_{2}=1)\Bigg{\}}\\ &\qquad\times\mathbbm{1}(R_{2}=1)\frac{dQ_{1}(A_{1}\mid H_{1})}{d\mathbb{P}(A_{1}\mid H_{1})}\frac{1}{d\mathbb{P}(R_{2}=1\mid H_{1},A_{1})}\\ \end{aligned}
- C)
\begin{aligned} &\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\mathcal{IF}\Big{(}d\mathbb{P}(h_{1})\Big{)}\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\big{\{}\mathbbm{1}(X_{1}=x_{1})-d\mathbb{P}(x_{1})\big{\}}\prod_{s=1}^{2}dQ_{s}(a_{s}\mid h_{s},R_{s}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}\setminus\mathcal{H}_{1}}\mu(h_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid h_{2},R_{2}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)dQ_{1}(a_{1}|h_{1})-m_{0}\\ &=\int_{\mathcal{A}_{1}}m_{1}(h_{1},a_{1},R_{2}=1)dQ_{1}(a_{1}|h_{1})-m_{0}\\ \end{aligned}
- D)
Let denote the efficient influence function for as given in Lemma F.3. Then we have
\begin{aligned} &\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(h_{1})d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\mathcal{IF}\Big{(}dQ_{1}(a_{1}|h_{1})dQ_{2}(a_{2}\mid h_{2},R_{2}=1)\Big{)}\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(h_{1})d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\frac{\mathbbm{1}\big{\{}(H_{2},R_{2})=(h_{2},1)\big{\}}}{d\mathbb{P}(h_{2},R_{2}=1)}\phi_{2}dQ_{1}(a_{1}|h_{1})\\ &\quad+\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(h_{1})d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\frac{\mathbbm{1}\big{\{}(H_{1}=h_{1})\big{\}}}{d\mathbb{P}(h_{1})}\phi_{1}dQ_{2}(a_{2}\mid h_{2},R_{2}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)\frac{\mathbbm{1}\big{\{}(H_{2},R_{2})=(h_{2},1)\big{\}}d\mathbb{P}(h_{1})d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)dQ_{1}(a_{1}|h_{1})}{d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)d\mathbb{P}(R_{2}=1|h_{1},a_{1})d\mathbb{P}(a_{1}|h_{1})d\mathbb{P}(h_{1})}\phi_{2}\\ &\quad+\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}}\mu(h_{2},a_{2},R_{3}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\mathbbm{1}\big{\{}(H_{1}=h_{1})\big{\}}\phi_{1}dQ_{2}(a_{2}\mid h_{2},R_{2}=1)\\ &=\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}\setminus\mathcal{H}_{2}}\mu(H_{2},a_{2},R_{3}=1)\mathbbm{1}(R_{2}=1)\phi_{2}\frac{dQ_{1}(A_{1}\mid H_{1})}{d\mathbb{P}(A_{1}\mid H_{1})}\frac{1}{d\mathbb{P}(R_{2}=1\mid H_{1},A_{1})}\\ &\quad+\int_{\mathcal{H}_{2}\times\mathcal{A}_{2}\setminus\mathcal{H}_{1}}\mu(h_{2},a_{2},R_{3}=1)dQ_{2}(a_{2}\mid h_{2},R_{2}=1)d\mathbb{P}(y_{1},x_{2}|h_{1},a_{1},R_{2}=1)\phi_{1}\\ &=\left\{\frac{dQ_{1}(A_{1}\mid H_{1})}{d\mathbb{P}(A_{1}\mid H_{1})}\frac{1}{d\mathbb{P}(R_{2}=1\mid H_{1},A_{1})}\right\}\int_{\mathcal{A}_{2}}\mu(H_{2},a_{2},R_{3}=1)\phi_{2}d\nu(a_{2})\mathbbm{1}(R_{2}=1)\\ &\quad+\int_{\mathcal{A}_{1}}m_{1}(h_{1},a_{1},R_{2}=1)\phi_{1}d\nu(a_{1})\end{aligned}
Note that we have set , and that we have and by construction. Hence, putting part A), B), and C) together proves Lemma F.4 and part D) proves Lemma F.5.
Note that to formally verify that the given expressions in Lemmas F.4 and F.5 are the efficient influence functions, we would need to check if the pathwise differentiability formula (8) holds. This essentially follows if the remainder terms are second-order, which will be verified in Lemmas (F.7) and (F.8) later. ∎
Finally, we are ready to give a proof of Theorem 4.1. In fact, it is nothing but rearranging terms in the given efficient influence function.
F.2.2 Proof of Theorem 4.1
Proof.
First, we define following shorthand notations for the proof: for
[TABLE]
[TABLE]
[TABLE]
With these notations we can rewrite the result of Lemma F.4 as below.
[TABLE]
Now, by the result of Lemma F.4 and F.5, we can represent the efficient influence function for as
[TABLE]
On the other hand, we have
[TABLE]
[TABLE]
[TABLE]
[TABLE]
which leads to
[TABLE]
After some rearrangement, we finally obtain an equivalent form of the efficient influence function for by
[TABLE]
Note that we use convention that and . ∎
F.3 Proof of Theorem 6.1
Let denote the standard IPW estimator of a classical deterministic intervention effect under assumption, i.e.
[TABLE]
Hence is equivalent to in the main text. Now by definition we have
[TABLE]
where and are simply the first and second term in the first line of the expansion respectively.
By the same procedure to derive g-formula (Robins, 1986) it is easy to see
[TABLE]
where . Above result simply follows by iterative expectation conditioning on and then another iterative expectation conditioning on followed by the fact that \mathbb{E}\left[\frac{\mathbbm{1}\left({A}_{t}={a^{\prime}}_{t}\right)}{\pi_{t}(a^{\prime}_{t}|H_{t})}\big{|}H_{t}\right]=1 for all . We repeat this process times, starting from all the way through .
Likewise, for we have
[TABLE]
For the first term , observe that
[TABLE]
where we apply the law of total expectation in the first equality and the law of total probability in the second.
After repeating the same process for times, for , we obtain terms in the end where each of which corresponds to the distinct treatment sequences . Hence, we eventually have
[TABLE]
Recall that we assume for all as stated in Theorem 6.1. Hence we can write as .
We want to find an upper bound of the variance ratio for always-treated unit (i.e., ). This can be done by computing the quantity
[TABLE]
since by Jensen’s inequality.
Note that we have
[TABLE]
, and under the given boundedness assumption we see the ratio of the second term to the first term becomes quickly (at least exponentially) negligible as increases. Hence we can write
[TABLE]
for some constant such that \frac{1}{1-\mathbb{V}_{c.ipw.2}(\overline{\bm{1}})/\mathbb{V}_{c.ipw.1}(\overline{\bm{1}})}=\frac{1}{1-p^{T}{\left(\mathbb{E}\left[Y^{\overline{\bm{1}}}\right]\right)^{2}}\big{/}{\mathbb{E}\left[\left(Y^{\overline{\bm{1}}}\right)^{2}\right]}}\leq{c}. Note that in our setting in which we have an infinitely large value of , can be almost any constant greater than one.
Putting above ingredients together, for sufficiently large it follows that
[TABLE]
where we have
[TABLE]
where the first equality follows by the fact that derived in the proof of the first part, the second equality by the fact that , the first inequality by definition of and the given boundedness assumption, and the last equality by binomial theorem. Therefore we obtain the upper bound as
[TABLE]
Next for the lower bound, first we note that
[TABLE]
where the first equality follows by definition, the second equality by exactly same process used to find the expression for , the first inequality by the boundedness assumption, and the third equality by binomial theorem.
However, we already know that
[TABLE]
Hence putting these together we conclude
[TABLE]
At this point, we obtain upper and lower bound for , which yields the result of part having .
Proof for the case of (never-treated unit) is based on the almost same steps as the case of except for the rearragement of terms due to replacing by and so on. In fact, due to the generality of our proof structure, the exact same logic used for also applies to (and for ). We present the result without the proof as below.
[TABLE]
where we define and .
F.4 Proof of Corollary 6.1
Now we provide following Lemma F.6 which becomes a key to prove Corollary 6.1.
Lemma F.6**.**
Assume that for all for . Then we have following variance decomposition :
[TABLE]
where for the weight is defined by
[TABLE]
Proof.
From the last display for , we have that
[TABLE]
where we let weight denote the product term .
Next, we observe that
[TABLE]
where we have decomposed into terms by defining by
[TABLE]
Then for fixed it is straightforward to see that
[TABLE]
Now putting this together, we obtain
[TABLE]
However, from the second term in the last display one could notice that
[TABLE]
where the last equality follows by the fact that
[TABLE]
Hence finally we conclude that
[TABLE]
∎
In Lemma F.6 it should be noticed that the weight exponentially and monotonically decays to zero for .
Now we show that there always exists such that for all . Let . From Lemma F.6 it follows that
[TABLE]
where , A(\delta,p)=\sum_{\overline{a}_{T},\overline{a^{\prime}}_{T}\in\overline{\mathcal{A}}_{T}}\sqrt{w(\overline{a}_{T};\delta,p)w(\overline{a^{\prime}}_{T};\delta,p)}\frac{\mathbb{E}\left[Y^{\overline{a}_{T}}\right]}{b_{u}}\frac{\mathbb{E}\big{[}Y^{\overline{a^{\prime}}_{T}}\big{]}}{b_{u}}, and B=\frac{\left(\mathbb{E}\big{[}Y^{{\overline{\bm{1}}}}\big{]}\right)^{2}}{b^{2}_{u}}. We note that , , and as .
For , . Hence based on above observation, it follows that for sufficiently large the last display is strictly less than zero. Consequently we conclude for all , which is the result of part . Likewise, we have the same conclusion for such that .
The value of is determined by , and distribution of counterfactual outcome . One rough upper bound of such is
[TABLE]
which could be obtained by the last display above and is always finite due to the fact by given assumption in the theorem. should not be very large for moderately large value of unless is unreasonably small since the difference also grows exponentially.
F.5 Proof of Theorem 5.1
First we need to define the following notations:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
where we let , let denote the empirical process on the full sample as usual, and let and let be a mean-zero Gaussian process with covariance as defined in Theorem 5.1 in the main text.
The proof consists of two parts; in the first part we will show in and in the second we will show .
Part 1. A proof of the first statement immediately follows from the proof of Theorem 3 in Kennedy (2019) who showed the function class is Lipschitz and thus has a finite bracketing integral for any fixed set of nuisance functions. Then Theorem 2.5.6 in Van Der Vaart and Wellner (1996) gives the result. In our case, the function class is still Lipschitz, since for we have
[TABLE]
[TABLE]
[TABLE]
where we use assumption 1) and 2) in the Theorem, and the identification assumption (A3) that there exist a constant such that and thus a.e. []. Therefore, every is basically a finite sum of products of Lipschitz functions with bounded and we thus conclude is Lipschitz. Hence our function class still has a finite bracketing integral for fixed and , which completes the first part of our proof.
Part 2. Let be the sample size in any group , and denote the empirical process over group k units by . From the result of Part 1 and the proof of Theorem 3 in Kennedy (2019) we have
[TABLE]
Now we analyze two pieces and in the last display. follows by the exact same steps done by Kennedy (2019). However, analysis on requires extra work.
To analyze , we use the same notation used in Kennedy (2019). First let denote the mean outcome under intervention for a population corresponding to observed data distribution . Next, let denote its centered efficient influence function when does not depend on , as given in Lemma F.4 and let denote the contribution to the efficient influence function due to estimating when it depends on , as given in Lemma F.5. Now by definition,
[TABLE]
and after some rearrangement we obtain
[TABLE]
Although one can relate to in above equation, it can be anything associated with new and .
Hence, by analyzing the second order remainder terms of von Mises expansion for the efficient influence functions given in Lemma F.4 and F.5, we can evaluate the convergence rate of . The following two lemmas analyze those second order remainder terms in the presence of dropout process.
Lemma F.7**.**
Let be a mean outcome under intervention for a for a population corresponding to observed data distribution , and let denote its efficient influence function when does not depend on for given , as given in Lemma F.4. For another data distribution , let denote the corresponding nuisance functions. Then we have the 1st-order von Mises expansion
[TABLE]
where we define
[TABLE]
[TABLE]
[TABLE]
Proof.
From Lemma F.4, we have
[TABLE]
where the first equality follows by the definition and linearity of expectation, the second by iterated expectation and the equivalence between and 555For the event implies for all by construction. , the third by the law of total probability on conditional expectation 666For random variables , when is discrete it follows ., the fourth by the result of Lemma F.1 (i.e. ). To obtain the last equality, we first apply iterated expectation conditioning on , then do another iterated expectation conditioning on followed by same steps from the second, the third and the fourth equalities, and repeat these processes for .
From the last expression, now we have
[TABLE]
Note that we use the convention from earlier lemmas that all the quantities with negative times (e.g., ) are set to one. After repeating above process times to the second last term in the last display, we obtain that
[TABLE]
By Lemma 5 in Kennedy (2019) it follows
[TABLE]
Putting all these together, we have
[TABLE]
, which yields the formula we have in Lemma F.7. ∎
Lemma F.8**.**
Let denote the contribution to the efficient influence function due to dependence between and as given in Lemma F.5. Then for two different intervention distributions and whose corresponding densities are and respectively with respect to some dominating measure for , we have the 1st-order Von Mises expansion
[TABLE]
where we define all the notation in the same way in Lemma F.7.
Proof.
From Lemma 6 in Kennedy (2019) and by Lemma F.1, we have
[TABLE]
Next, for the expected contribution to the influence function due to estimating when it depends on , we have that
[TABLE]
where the first equality by definition, the second by iterated expectation conditioning on and averaging over , the third by iterated expectation conditioning on and law of total probability, and the fifth by repeating the process times.
Now, we further expand our last expression as
[TABLE]
where the first equality follows by adding and subtracting the second term, an the second by the same steps used in Lemma F.7.
With the last term in the last expression above, it follows
[TABLE]
Putting these all together, finally we have
[TABLE]
∎
Finally, the next Lemma completes the proof of the Theorem 5.1.
Lemma F.9**.**
Remainders of the von Mises expansion from Lemma F.7 and F.8 are both diminishing at rate of uniformly in , if
[TABLE]
for .
Proof.
The remainder term of the Von Mises expansion from Lemma F.7 equals
[TABLE]
where we obtain the first inequality simply by adding and subtracting .
For the remainder term from Lemma F.8, first note that by Lemma F.1 and Lemma 6 of Kennedy (2019),
[TABLE]
[TABLE]
Hence, it immediately follows that the remainder term in Lemma F.8 can be bounded by
[TABLE]
Therefore, if we have
[TABLE]
then both of the remainders from Lemma F.7 and F.8 are diminishing at rate of uniformly in . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Kumar et al. (2013) Santosh Kumar, Wendy J Nilsen, Amy Abernethy, Audie Atienza, Kevin Patrick, Misha Pavel, William T Riley, Albert Shar, Bonnie Spring, Donna Spruijt-Metz, et al. Mobile health technology evaluation: the mhealth evidence workshop. American journal of preventive medicine , 45(2):228–236, 2013.
- 2Eysenbach et al. (2011) Gunther Eysenbach, Consort-EHEALTH Group, et al. Consort-ehealth: improving and standardizing evaluation reports of web-based and mobile health interventions. Journal of medical Internet research , 13(4), 2011.
- 3Klasnja et al. (2015) Predrag Klasnja, Eric B Hekler, Saul Shiffman, Audrey Boruvka, Daniel Almirall, Ambuj Tewari, and Susan A Murphy. Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology , 34(S):1220, 2015.
- 4Robins (1986) James Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling , 7(9-12):1393–1512, 1986.
- 5Robins et al. (2000) James M Robins, Miguel Angel Hernan, and Babette Brumback. Marginal structural models and causal inference in epidemiology, 2000.
- 6Hernán et al. (2000) Miguel Ángel Hernán, Babette Brumback, and James M Robins. Marginal structural models to estimate the causal effect of zidovudine on the survival of hiv-positive men. Epidemiology , pages 561–570, 2000.
- 7Murphy et al. (2001) Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association , 96(456):1410–1423, 2001.
- 8Robins (2004) James M Robins. Optimal structural nested models for optimal sequential decisions. In Proceedings of the second seattle Symposium in Biostatistics , pages 189–326. Springer, 2004.
