Nonparametric causal effects based on incremental propensity score   interventions

Edward H. Kennedy

arXiv:1704.00211·stat.ME·June 20, 2018

Nonparametric causal effects based on incremental propensity score interventions

Edward H. Kennedy

PDF

Open Access

TL;DR

This paper introduces incremental propensity score interventions in causal inference, which shift treatment probabilities instead of fixing treatments, avoiding positivity violations and parametric assumptions, and enabling efficient, flexible longitudinal effect estimation.

Contribution

It proposes a novel incremental intervention framework that circumvents positivity issues, requires no parametric models, and simplifies longitudinal effect visualization and estimation.

Findings

01

Incremental interventions avoid positivity violations.

02

Efficient nonparametric estimators with fast convergence are developed.

03

Application to sociological effects of incarceration demonstrates practical utility.

Abstract

Most work in causal inference considers deterministic interventions that set each unit's treatment to some fixed value. However, under positivity violations these interventions can lead to non-identification, inefficiency, and effects with little practical relevance. Further, corresponding effects in longitudinal studies are highly sensitive to the curse of dimensionality, resulting in widespread use of unrealistic parametric models. We propose a novel solution to these problems: incremental interventions that shift propensity score values rather than set treatments to fixed values. Incremental interventions have several crucial advantages. First, they avoid positivity assumptions entirely. Second, they require no parametric assumptions and yet still admit a simple characterization of longitudinal effects, independent of the number of timepoints. For example, they allow longitudinal…

Tables1

Table 1. Table 1: Coverage of proposed uniform 95% confidence band across 500 simulations.

Sample	Coverage (%) for setting:
size $n$	Cor P	Mis P	Cor NP	Mis NP
500	92.4	77.0	93.0	88.0
1000	95.2	67.6	95.6	92.4
5000	94.8	12.4	94.2	89.4

Equations209

Z = (X_{1}, A_{1}, X_{2}, A_{2}, ..., X_{T}, A_{T}, Y)

Z = (X_{1}, A_{1}, X_{2}, A_{2}, ..., X_{T}, A_{T}, Y)

q_{t} (h_{t}; δ, π_{t}) = \frac{δ π _{t} ( h _{t} )}{δ π _{t} ( h _{t} ) + 1 - π _{t} ( h _{t} )} for 0 < δ < \infty.

q_{t} (h_{t}; δ, π_{t}) = \frac{δ π _{t} ( h _{t} )}{δ π _{t} ( h _{t} ) + 1 - π _{t} ( h _{t} )} for 0 < δ < \infty.

δ = \frac{q _{t} ( h _{t} ) / { 1 - q _{t} ( h _{t} )}}{π _{t} ( h _{t} ) / { 1 - π _{t} ( h _{t} )}} = \frac{odds _{q} ( A _{t} = 1 ∣ H _{t} = h _{t} )}{odds _{π} ( A _{t} = 1 ∣ H _{t} = h _{t} )}

δ = \frac{q _{t} ( h _{t} ) / { 1 - q _{t} ( h _{t} )}}{π _{t} ( h _{t} ) / { 1 - π _{t} ( h _{t} )}} = \frac{odds _{q} ( A _{t} = 1 ∣ H _{t} = h _{t} )}{odds _{π} ( A _{t} = 1 ∣ H _{t} = h _{t} )}

ψ (δ) = \overline{a}_{T} \in A^{T} \sum X \int μ (h_{T}, a_{T}) t = 1 \prod T \frac{a _{t} δ π _{t} ( h _{t} ) + ( 1 - a _{t} ) { 1 - π _{t} ( h _{t} )}}{δ π _{t} ( h _{t} ) + 1 - π _{t} ( h _{t} )} d P (x_{t} ∣ h_{t - 1}, a_{t - 1})

ψ (δ) = \overline{a}_{T} \in A^{T} \sum X \int μ (h_{T}, a_{T}) t = 1 \prod T \frac{a _{t} δ π _{t} ( h _{t} ) + ( 1 - a _{t} ) { 1 - π _{t} ( h _{t} )}}{δ π _{t} ( h _{t} ) + 1 - π _{t} ( h _{t} )} d P (x_{t} ∣ h_{t - 1}, a_{t - 1})

ψ (δ) = E [\frac{δ π ( X ) μ ( X , 1 ) + { 1 - π ( X )} μ ( X , 0 )}{δ π ( X ) + { 1 - π ( X )}}]

ψ (δ) = E [\frac{δ π ( X ) μ ( X , 1 ) + { 1 - π ( X )} μ ( X , 0 )}{δ π ( X ) + { 1 - π ( X )}}]

t = 1 \sum T

t = 1 \sum T

\times {s = 1 \prod t \frac{( δ A _{s} + 1 - A _{s} )}{δ π _{s} ( H _{s} ) + 1 - π _{s} ( H _{s} )}} + t = 1 \prod T \frac{( δ A _{t} + 1 - A _{t} ) Y}{δ π _{t} ( H _{t} ) + 1 - π _{t} ( H _{t} )} - ψ (δ)

m_{t} (h_{t}, a_{t})

m_{t} (h_{t}, a_{t})

m_{t-1}(\mathbf{H}_{t-1},A_{t-1})=\mathbb{E}\left[\frac{\delta\pi_{t}(\mathbf{H}_{t})m_{t}(\mathbf{H}_{t},1)+\{1-\pi_{t}(\mathbf{H}_{t})\}m_{t}(\mathbf{H}_{t},0)}{\delta\pi_{t}(\mathbf{H}_{t})+1-\pi_{t}(\mathbf{H}_{t})}\Bigm{|}\mathbf{H}_{t-1},A_{t-1}\right]

m_{t-1}(\mathbf{H}_{t-1},A_{t-1})=\mathbb{E}\left[\frac{\delta\pi_{t}(\mathbf{H}_{t})m_{t}(\mathbf{H}_{t},1)+\{1-\pi_{t}(\mathbf{H}_{t})\}m_{t}(\mathbf{H}_{t},0)}{\delta\pi_{t}(\mathbf{H}_{t})+1-\pi_{t}(\mathbf{H}_{t})}\Bigm{|}\mathbf{H}_{t-1},A_{t-1}\right]

\frac{δ π ( X ) ϕ _{1} ( Z ) + { 1 - π ( X )} ϕ _{0} ( Z )}{δ π ( X ) + { 1 - π ( X )}} + \frac{δ γ ( X ) { A - π ( X )}}{{ δ π ( X ) + 1 - π ( X ) } ^{2}} - ψ (δ)

\frac{δ π ( X ) ϕ _{1} ( Z ) + { 1 - π ( X )} ϕ _{0} ( Z )}{δ π ( X ) + { 1 - π ( X )}} + \frac{δ γ ( X ) { A - π ( X )}}{{ δ π ( X ) + 1 - π ( X ) } ^{2}} - ψ (δ)

ϕ_{a} (Z) = \frac{\mathbbold 1 ( A = a )}{π ( a ∣ X )} {Y - μ (X, a)} + μ (X, a)

ϕ_{a} (Z) = \frac{\mathbbold 1 ( A = a )}{π ( a ∣ X )} {Y - μ (X, a)} + μ (X, a)

η = (π, m) = (π_{1}, ..., π_{T}, m_{1}, ..., m_{T}) .

η = (π, m) = (π_{1}, ..., π_{T}, m_{1}, ..., m_{T}) .

\hat{ψ}^{*} (δ) = P_{n} {φ (Z; \hat{η}, δ)}

\hat{ψ}^{*} (δ) = P_{n} {φ (Z; \hat{η}, δ)}

\hat{ψ}_{i pw}^{*} (δ) = P_{n} {t = 1 \prod T \frac{( δ A _{t} + 1 - A _{t} ) Y}{δ π ^ _{t} ( H _{t} ) + 1 - π ^ _{t} ( H _{t} )}} .

\hat{ψ}_{i pw}^{*} (δ) = P_{n} {t = 1 \prod T \frac{( δ A _{t} + 1 - A _{t} ) Y}{δ π ^ _{t} ( H _{t} ) + 1 - π ^ _{t} ( H _{t} )}} .

\hat{ψ} (δ) = \frac{1}{K} k = 1 \sum K P_{n}^{k} {φ (Z; \hat{η}_{- k}, δ)} = P_{n} {φ (Z; \hat{η}_{- S}, δ)}

\hat{ψ} (δ) = \frac{1}{K} k = 1 \sum K P_{n}^{k} {φ (Z; \hat{η}_{- k}, δ)} = P_{n} {φ (Z; \hat{η}_{- S}, δ)}

\frac{ψ ^ ( δ ) - ψ ( δ )}{σ ^ ( δ ) / n} ⇝ G (δ)

\frac{ψ ^ ( δ ) - ψ ( δ )}{σ ^ ( δ ) / n} ⇝ G (δ)

\hat{ψ} (δ) \pm 1.96 \overset{σ}{^} (δ) / n

\hat{ψ} (δ) \pm 1.96 \overset{σ}{^} (δ) / n

P (δ \in D sup \frac{ψ ^ ( δ ) - ψ ( δ )}{σ ^ ( δ ) / n} \leq c_{α}) = 1 - α + o (1),

P (δ \in D sup \frac{ψ ^ ( δ ) - ψ ( δ )}{σ ^ ( δ ) / n} \leq c_{α}) = 1 - α + o (1),

\sqrt{n}\mathbb{P}_{n}\Big{[}\xi\{\varphi(\mathbf{Z};\hat{\boldsymbol{\eta}}_{\text{-}S},\delta)-\hat{\psi}(\delta)\}/\hat{\sigma}(\delta)\Big{]}

\sqrt{n}\mathbb{P}_{n}\Big{[}\xi\{\varphi(\mathbf{Z};\hat{\boldsymbol{\eta}}_{\text{-}S},\delta)-\hat{\psi}(\delta)\}/\hat{\sigma}(\delta)\Big{]}

\mathbb{P}\left(\sup_{\delta\in\mathcal{D}}\left|\sqrt{n}\ \mathbb{P}_{n}\!\left[\xi\left\{\frac{\varphi(\mathbf{Z};\hat{\boldsymbol{\eta}}_{\text{-}S},\delta)-\hat{\psi}(\delta)}{\hat{\sigma}(\delta)}\right\}\right]\right|\geq\hat{c}_{\alpha}\Bigm{|}\mathbf{Z}_{1},...,\mathbf{Z}_{n}\right)=\alpha

\mathbb{P}\left(\sup_{\delta\in\mathcal{D}}\left|\sqrt{n}\ \mathbb{P}_{n}\!\left[\xi\left\{\frac{\varphi(\mathbf{Z};\hat{\boldsymbol{\eta}}_{\text{-}S},\delta)-\hat{\psi}(\delta)}{\hat{\sigma}(\delta)}\right\}\right]\right|\geq\hat{c}_{\alpha}\Bigm{|}\mathbf{Z}_{1},...,\mathbf{Z}_{n}\right)=\alpha

P {\hat{ψ} (δ) - \frac{c ^ _{α} σ ^ ( δ )}{n} \leq ψ (δ) \leq \hat{ψ} (δ) + \frac{c ^ _{α} σ ^ ( δ )}{n}, for all δ \in D} = 1 - α + o (1) .

P {\hat{ψ} (δ) - \frac{c ^ _{α} σ ^ ( δ )}{n} \leq ψ (δ) \leq \hat{ψ} (δ) + \frac{c ^ _{α} σ ^ ( δ )}{n}, for all δ \in D} = 1 - α + o (1) .

H_{0} : ψ (δ) = E (Y) for all δ \in D,

H_{0} : ψ (δ) = E (Y) for all δ \in D,

\hat{p}=\sup\Big{\{}\alpha:\inf_{\delta\in\mathcal{D}}\{\hat{\psi}(\delta)+\hat{c}_{\alpha}\hat{\sigma}(\delta)/\sqrt{n}\}\geq\sup_{\delta\in\mathcal{D}}\{\hat{\psi}(\delta)-\hat{c}_{\alpha}\hat{\sigma}(\delta)/\sqrt{n}\}\Big{\}}.

\hat{p}=\sup\Big{\{}\alpha:\inf_{\delta\in\mathcal{D}}\{\hat{\psi}(\delta)+\hat{c}_{\alpha}\hat{\sigma}(\delta)/\sqrt{n}\}\geq\sup_{\delta\in\mathcal{D}}\{\hat{\psi}(\delta)-\hat{c}_{\alpha}\hat{\sigma}(\delta)/\sqrt{n}\}\Big{\}}.

(X_{1}, X_{2}, X_{3}, X_{4}) \sim N (0, I), P (A = 1 ∣ X) = expit (- X_{1} + 0.5 X_{2} - 0.25 X_{3} - 0.1 X_{4}) (Y ∣ X, A) \sim N {μ (X, A), 1}

(X_{1}, X_{2}, X_{3}, X_{4}) \sim N (0, I), P (A = 1 ∣ X) = expit (- X_{1} + 0.5 X_{2} - 0.25 X_{3} - 0.1 X_{4}) (Y ∣ X, A) \sim N {μ (X, A), 1}

\hat{ψ}_{p i} (δ) = P_{n} [\frac{δ π ^ ( X ) μ ^ ( X , 1 ) + { 1 - π ^ ( X )} μ ^ ( X , 0 )}{δ π ^ ( X ) + 1 - π ^ ( X )}],

\hat{ψ}_{p i} (δ) = P_{n} [\frac{δ π ^ ( X ) μ ^ ( X , 1 ) + { 1 - π ^ ( X )} μ ^ ( X , 0 )}{δ π ^ ( X ) + 1 - π ^ ( X )}],

\widehat{\text{bias}}=\frac{1}{I}\sum_{i=1}^{I}\Big{|}\frac{1}{J}\sum_{j=1}^{J}\hat{\psi}_{j}(\delta_{i})-\psi(\delta_{i})\Big{|}\ ,\ \ \widehat{\text{RMSE}}=\frac{\sqrt{n}}{I}\sum_{i=1}^{I}\left[\frac{1}{J}\sum_{j=1}^{J}\Big{\{}\hat{\psi}_{j}(\delta_{i})-\psi(\delta_{i})\Big{\}}^{2}\right]^{1/2}

\widehat{\text{bias}}=\frac{1}{I}\sum_{i=1}^{I}\Big{|}\frac{1}{J}\sum_{j=1}^{J}\hat{\psi}_{j}(\delta_{i})-\psi(\delta_{i})\Big{|}\ ,\ \ \widehat{\text{RMSE}}=\frac{\sqrt{n}}{I}\sum_{i=1}^{I}\left[\frac{1}{J}\sum_{j=1}^{J}\Big{\{}\hat{\psi}_{j}(\delta_{i})-\psi(\delta_{i})\Big{\}}^{2}\right]^{1/2}

d P (a_{t} ∣ h_{t}) = 0 ⟹ d Q_{t} (a_{t} ∣ h_{t}) = 0

d P (a_{t} ∣ h_{t}) = 0 ⟹ d Q_{t} (a_{t} ∣ h_{t}) = 0

ψ^{*} (Q) = \int_{A} \int_{X} E (Y ∣ \overline{X}_{T} = \overline{x}_{T}, \overline{A}_{T} = \overline{a}_{T}) t = 1 \prod T d Q_{t} (a_{t} ∣ h_{t}) d P (x_{t} ∣ h_{t - 1}, a_{t - 1}),

ψ^{*} (Q) = \int_{A} \int_{X} E (Y ∣ \overline{X}_{T} = \overline{x}_{T}, \overline{A}_{T} = \overline{a}_{T}) t = 1 \prod T d Q_{t} (a_{t} ∣ h_{t}) d P (x_{t} ∣ h_{t - 1}, a_{t - 1}),

E (Y^{(\overline{a}_{t - 1}, \underline{Q}_{t})}

E (Y^{(\overline{a}_{t - 1}, \underline{Q}_{t})}

= \int E (Y^{(\overline{a}_{t - 1}, q_{t}, \underline{Q}_{t + 1})} ∣ H_{t} = h_{t}, Q_{t} = q_{t}) d Q_{t} (q_{t} ∣ h_{t}) d P (x_{t} ∣ h_{t - 1}, a_{t - 1})

= \int E (Y^{(\overline{a}_{t - 1}, q_{t}, \underline{Q}_{t + 1})} ∣ H_{t} = h_{t}, A_{t} = q_{t}) d Q_{t} (q_{t} ∣ h_{t}) d P (x_{t} ∣ h_{t - 1}, a_{t - 1})

= \int E (Y^{(\overline{a}_{t}, \underline{Q}_{t + 1})} ∣ H_{t} = h_{t}, A_{t} = a_{t}) d Q_{t} (a_{t} ∣ h_{t}) d P (x_{t} ∣ h_{t - 1}, a_{t - 1})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Causal Inference Techniques · Statistical Methods and Bayesian Inference · Statistical Methods and Inference

Full text

Nonparametric causal effects based on incremental propensity score interventions

Edward H. Kennedy

Department of Statistics, Carnegie Mellon University Edward Kennedy is Assistant Professor in the Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213 (e-mail: [email protected]). The author thanks Traci Kennedy, Miguel Hernan, Kwangho Kim, and the Causal Inference Reading Group at Carnegie Mellon for helpful discussions and comments, and Valerio Bacak for guidance on the National Longitudinal Survey of Youth data analysis.

Abstract

Most work in causal inference considers deterministic interventions that set each unit’s treatment to some fixed value. However, under positivity violations these interventions can lead to non-identification, inefficiency, and effects with little practical relevance. Further, corresponding effects in longitudinal studies are highly sensitive to the curse of dimensionality, resulting in widespread use of unrealistic parametric models. We propose a novel solution to these problems: incremental interventions that shift propensity score values rather than set treatments to fixed values. Incremental interventions have several crucial advantages. First, they avoid positivity assumptions entirely. Second, they require no parametric assumptions and yet still admit a simple characterization of longitudinal effects, independent of the number of timepoints. For example, they allow longitudinal effects to be visualized with a single curve instead of lists of coefficients. After characterizing incremental interventions and giving identifying conditions for corresponding effects, we also develop general efficiency theory, propose efficient nonparametric estimators that can attain fast convergence rates even when incorporating flexible machine learning, and propose a bootstrap-based confidence band and simultaneous test of no treatment effect. Finally we explore finite-sample performance via simulation, and apply the methods to study time-varying sociological effects of incarceration on entry into marriage.

Keywords: observational study, positivity, stochastic intervention, time-varying confounding, treatment effect.

1 Introduction

Most work in causal inference considers deterministic interventions that set each unit’s treatment to some fixed value. For example, the usual average treatment effect indicates how mean outcomes would change if all units were uniformly assigned treatment versus control. Similarly, standard marginal structural models [28] describe outcomes had all units followed given exposure trajectories over time (e.g., treated at every time, treated after time $t$ , etc.). However, these simple effects are not identified when some units have zero chance to receive given treatment options. This is a violation of the so-called positivity assumption, which has been known in the causal inference literature since at least [30]. Even if positivity is only nearly violated (i.e., chances of some treatment options are merely small), the finite-sample behavior of many common estimators can be severely degraded [14, 18].

Similarly, even if positivity holds, in longitudinal studies these standard effects are afflicted by a curse of dimensionality in the number of study timepoints: exponentially many samples are needed to learn about all treatment trajectories. For example, in a simple trial with a binary randomized treatment and ten timepoints, we would need nearly 12,000 patients to guarantee $<$ 1% chance of having any unrepresented exposure trajectories. The usual way to deal with this problem is to assume it away with a parametric model for how outcomes change across trajectories. However, such models are typically severely wrong if overly simple, and can be hard to interpret otherwise (and are often still misspecified). Further, in the real world, treatment is typically not applied uniformly, so static deterministic interventions may not be of practical policy interest. For example, most medical treatments would never be applied indiscriminately, but instead would be recommended or not based on characteristics of the patient and physician prescribing preference.

Thus there has been substantial recent interest in dynamic and stochastic interventions, which can depend on unit characteristics and be random rather than deterministic. Examples have been studied for point exposures by [34, 22, 10], and in longitudinal studies by [20, 26, 36, 27, 33, 4, 43]. Particularly relevant to this paper is work by [9, 18, 8, 12, 44], who consider interventions that depend on the observational treatment process. However, to the best of our knowledge, none of the existing intervention effects both avoids positivity conditions entirely and is completely nonparametric (even in studies with many timepoints).

In this paper we propose novel incremental intervention effects, based on shifting propensity scores rather than setting treatment values. We show that such effects can be identified and estimated without any positivity or parametric assumptions, and argue that they can be more realistic than other interventions. One trade-off is that they yield effects that are more descriptive than prescriptive. We develop nonparametric influence-function-based estimators that can incorporate flexible machine learning tools, while still providing valid parametric-rate inference. Our methods for uniform inference (using the multiplier bootstrap) also yield a new general test of no treatment effect. We conclude with a simulation study, and apply the methods in a longitudinal study of incarceration effects.

2 Notation & Setup

We consider the case where we observe a sample $(\mathbf{Z}_{1},...,\mathbf{Z}_{n})$ of iid observations from distribution $\mathbb{P}$ , with

[TABLE]

for covariates $\mathbf{X}_{t}$ and treatment $A_{t}$ at time $t$ , and outcome $Y$ . For simplicity, at present we consider binary treatments (so the support of $A_{t}$ is $\mathcal{A}=\{0,1\}$ ) and completely observed $\mathbf{Z}$ (so there is no missingness or dropout), but extensions will appear in future work. We use overbars to denote the past history of a variable, so that $\overline{\mathbf{X}}_{t}=(\mathbf{X}_{1},...,\mathbf{X}_{t})$ and $\overline{A}_{t}=(A_{1},...,A_{t})$ , for example, and we let $\mathbf{H}_{t}=(\overline{\mathbf{X}}_{t},\overline{A}_{t-1})$ denote the past history just prior to treatment at time $t$ , with support $\mathcal{H}_{t}$ .

*Remark 1**.*

The above data setup also covers the case where outcomes are time-varying, i.e., where rather than $\mathbf{Z}$ the observations are given by $\mathbf{Z}^{*}=(\mathbf{X}^{*}_{1},A_{1},Y_{1},...,\mathbf{X}^{*}_{T^{*}},A_{T^{*}},Y_{T^{*}})$ . This follows since we can let $\mathbf{X}_{t}=(\mathbf{X}^{*}_{t},Y_{t-1})$ and $Y=Y_{T^{*}}$ in our original formulation. If interest centers on treatment effects on an earlier outcome $Y_{t}$ (for $t<T^{*}$ ), rather than $Y_{T^{*}}$ , then we can let $Y=Y_{t}$ and truncate the sequence, defining $T=t$ instead of $T=T^{*}$ .

In this paper we use potential outcomes [31], and so let $Y^{\overline{a}_{T}}$ denote the outcome that would have been observed had the treatment sequence $\overline{A}_{T}=\overline{a}_{T}$ been received. The quantity $Y^{\overline{a}_{T}}$ is an example of a counterfactual based on a deterministic static intervention, in which a fixed treatment is applied with probability one and regardless of covariate information (e.g., $\overline{A}_{T}=\overline{a}_{T}$ is applied uniformly across units, regardless of past histories $\mathbf{H}_{t}$ ). Deterministic static interventions are the kinds of interventions most commonly considered in practice; examples include the average effect of a point exposure $\mathbb{E}(Y^{1}-Y^{0})$ , and standard marginal structural model and structural nested model parameters $\mathbb{E}(Y^{\overline{a}_{T}})$ and $\mathbb{E}(Y^{\overline{a}_{t},0}-Y^{\overline{a}_{t-1},0}\mid\mathbf{H}_{t},A_{t})$ , respectively [28, 24].

Alternatively, in deterministic dynamic interventions [23, 20] treatment at time $t$ is assigned according to a fixed rule $d_{t}:\mathcal{H}_{t}\mapsto\mathcal{A}$ that depends on past history. Characterizing and estimating the optimal such rule is a major goal in the optimal dynamic treatment regime literature [19, 25]. The potential outcome under a sequence of hypothetical rules $\textbf{d}=\overline{d}_{T}=(d_{1},...,d_{T})$ can be expressed as $Y^{\textbf{d}}$ , where the dependence of the rules on the histories $\mathbf{H}_{t}$ is suppressed for notational simplicity, and d is lower-case since the rule is non-random (given the histories). A simple example is the rule $d_{t}=\mathbbold{1}(V_{t}\geq c_{t})$ that assigns treatment if a variable $V_{t}\subset\mathbf{X}_{t}$ passes some threshold $c_{t}\in\mathbb{R}$ , with corresponding mean outcome $\mathbb{E}(Y^{(d_{1},...,d_{T})})$ .

In this paper we propose a new form of stochastic dynamic intervention, which is an intervention where treatment at each time is randomly assigned based on a conditional distribution $q_{t}(a_{t}\mid\mathbf{h}_{t})$ . Stochastic interventions can thus be viewed as random choices among deterministic rules. These interventions have not been studied as extensively as other types, with important exceptions listed in the Introduction (see for example [9, 12, 44] for review). We express the potential outcome under a stochastic intervention as $Y^{\textbf{Q}}$ , where $\textbf{Q}=(Q_{1},...,Q_{T})$ represents draws from the conditional distributions $q_{t}$ , and is upper-case since the intervention is stochastic. A simple stochastic intervention related to the previous rule $d_{t}=\mathbbold{1}(V_{t}\geq c_{t})$ would be $Q_{t}=\mathbbold{1}(V_{t}\geq C_{t})$ where $C_{t}\sim N(0,1)$ is now a random threshold.

3 Incremental Intervention Theory

In this section we first describe a new class of stochastic dynamic intervention, which we call incremental propensity score interventions, and give some motivation and examples. We go on to show that these interventions are nonparametrically identified without requiring any positivity restrictions on the propensity scores (e.g., the propensity scores do not need to be bounded away from zero and one). Then we describe the efficiency theory for estimating mean outcomes under these interventions, based on a new result for general stochastic interventions that depend on the observational treatment distribution.

3.1 Proposed Interventions

In this paper we propose incremental propensity score interventions that replace the observational treatment process (i.e., propensity score) $\pi_{t}(\mathbf{h}_{t})=\mathbb{P}(A_{t}=1\mid\mathbf{H}_{t}=\mathbf{h}_{t})$ with a shifted version, based on multiplying the odds of receiving treatment. Specifically, our proposed intervention replaces the observational propensity score $\pi_{t}$ with the distribution defined by

[TABLE]

The increment parameter $\delta\in(0,\infty)$ is user-specified, and dictates the extent to which the propensity scores are fluctuated from their actual observational values. In practice we recommend considering a range of $\delta$ values, depending on the scientific question at hand. Although the intervention distribution $q_{t}$ depends on both the increment $\delta$ and the observational propensity $\pi_{t}$ , we often drop this dependence and write $q_{t}(\mathbf{h}_{t};\delta,\pi_{t})=q_{t}(\mathbf{h}_{t})$ or just $q_{t}$ to ease notation. As mentioned earlier, [9, 18, 8, 12, 44] have considered other different interventions that depend on the observational treatment process.

Our choice of $q_{t}$ in (1) is motivated by both interpretability and mathematical convenience. In particular it yields

[TABLE]

whenever $0<\pi_{t}<1$ , so the increment $\delta$ is just an odds ratio, indicating how the intervention changes the odds of receiving treatment. As with usual odds ratios, if $\delta>1$ then the intervention increases the odds of receiving treatment, and if $\delta<1$ then the intervention decreases these odds (if $\delta=1$ then $q_{t}=\pi_{t}$ so the treatment process is left unchanged). For example, an intervention with $\delta=1.5$ would increase the odds of receiving treatment by 50% for each patient with $0<\pi_{t}<1$ : a patient with an actual 25% chance of receiving treatment (1/3 odds) would instead have a 33% chance (1/2 odds) under the intervention.

In addition to other kinds of shifts (e.g., risk ratios) in future work we will consider interventions for which $\delta=\delta_{t}(\mathbf{h}_{t})$ can depend on time and covariate history. However, even with $\delta$ fixed, incremental interventions are still dynamic, since the conditional distribution $q_{t}$ depends on the covariate history. In other words, these interventions are personalized to patient characteristics through the propensity score. For example, under an intervention with $\delta=1.5$ , a patient with a 50% chance of receiving treatment observationally would instead have a 60% chance under the intervention, while a patient whose chances were 5% would only see an increase to 7.3% (i.e., multiplying the odds by a fixed factor yields different shifts in the probabilities). Contrast this with a usual static intervention, which flatly assigns all patients a particular sequence $\overline{a}_{T}$ (or a random choice among such sequences) regardless of propensity score. Figure 1 illustrates incremental interventions with data on $n=20$ simulated observations in a hypothetical study with $T=2$ timepoints.

Figure 1 also helps illustrate why incremental interventions require weak identifying assumptions (to be discussed shortly), and can be more likely to occur in practice, compared to other kinds of interventions. Although usual static interventions (e.g., setting $A=1$ ) might require forcing treatment on someone with only a 1% chance of receiving it in the real world, the proposed incremental intervention only requires that the propensity score be slightly shifted (e.g., from 1% to 1.5% when $\delta=1.5$ ). In settings where treatment changes occur more gradually (e.g., when physicians slightly reduce treatment intensity, or judges become slightly more lenient), incremental interventions might be similar to treatment changes that could occur naturally in practice. Even if not especially likely to occur, incremental interventions still might be more realistic than standard static interventions since they are “closer” to the observational treatment distribution. Of course, incremental interventions can still be useful analysis tools even if not necessarily mimicking realistic treatment changes, as discussed in more detail in the next subsection.

One trade-off between incremental and more standard interventions is that, by virtue of their dependence on the observational treatment process, incremental interventions will often play a more descriptive rather than prescriptive role. Incremental interventions allow one to describe how outcomes would vary with gradual changes in treatment intensity; but they are typically less useful for making specific recommendations about optimal treatment.

Nonetheless incremental interventions do generalize common static and dynamic interventions (both deterministic and stochastic), since they can recover these interventions with particular choices of $\delta_{t}(\mathbf{h}_{t})$ . For example, if positivity holds then taking the values $\delta=\infty$ and $\delta=0$ recovers the usual static interventions, yielding potential outcomes $Y^{\mathbf{1}}$ and $Y^{\mathbf{0}}$ under exposures $\overline{A}_{T}=(1,...,1)$ and $\overline{A}_{T}=(0,...,0)$ , respectively. Thus incremental interventions can also be used for a sensitivity analysis of the positivity assumption. If positivity is violated, then $q_{t}\rightarrow\mathbbold{1}(\pi_{t}>0)$ for $\delta\rightarrow\infty$ , and $q_{t}\rightarrow\mathbbold{1}(\pi_{t}=1)$ for $\delta\rightarrow 0$ ; these are the “realistic individualized treatment rules” proposed by [36, 18], which are dynamic but deterministic. Finally, incremental interventions can recover general stochastic dynamic interventions (where $q^{*}_{t}$ replaces the propensity score $\pi_{t}$ ) by taking $\delta_{t}=\{q^{*}_{t}/(1-q^{*}_{t})\}/\{\pi_{t}/(1-\pi_{t})\}$ for some arbitrary $q^{*}_{t}$ , whenever defined.

3.2 Identification

In the previous section we described incremental propensity score interventions, which are based on shifting the propensity scores $\pi_{t}$ by multiplying the odds of receiving treatment by $\delta$ . We will now give assumptions that allow for identification of the entire marginal distribution of the resulting potential outcomes $Y^{\mathbf{Q}(\delta)}$ , although for simplicity we focus on estimating just the mean of this distribution.

Importantly, identification of incremental intervention effects requires no conditions on the propensity scores $\pi_{t}$ , since propensity scores that equal zero or one are not shifted. This is different from more common interventions that require propensity scores to be bounded or otherwise restricted in some way. Specifically we only require the following consistency and exchangeability assumptions.

*Assumption 1** (Consistency).*

$Y=Y^{\overline{a}_{T}}$ if $\overline{A}_{T}=\overline{a}_{T}$ .

*Assumption 2** (Exchangeability).*

$A_{t}\perp\!\!\!\perp Y^{\overline{a}_{T}}\mid\mathbf{H}_{t}$ .

Consistency means observed outcomes equal corresponding potential outcomes under the observed treatment sequence; it would be violated for example in network settings with interference, where outcomes can be affected by other units’ treatment assignment. Exchangeability means treatment assignment is essentially randomized within covariate strata; it can hold by design in a trial, but in observational studies it requires sufficiently many relevant adjustment covariates to be collected. Importantly, no conditions are needed on the propensity score, since fluctuations based on $q_{t}$ in (1) will leave the propensity score unchanged if it is zero or one. To the best of our knowledge, the only other work that has discussed removing positivity conditions entirely is [36, 18]; however, they utilize different (deterministic) interventions and consider parametric effect models. General interventions could be modified to similarly avoid positivity, by redefining them to not affect subjects with extreme propensity scores. Two benefits of incremental interventions are (i) avoiding positivity occurs naturally and smoothly via the definition of $q_{t}$ , rather than an inserted indicator; and (ii) as discussed shortly, effects under a wide range of treatment intensities can be summarized with a single curve rather than many regime-specific parameters.

The next theorem shows that the mean counterfactual outcome $\psi(\delta)=\mathbb{E}(Y^{\mathbf{Q}(\delta)})$ under the incremental intervention is identified and can be expressed uniquely in terms of the observed data distribution $\mathbb{P}$ .

Theorem 1.

Under Assumptions 1–2, and if $\delta\in\mathcal{D}=[\delta_{\ell},\delta_{u}]$ for $0<\delta_{\ell}\leq\delta_{u}<\infty$ , the incremental effect $\psi(\delta)=\mathbb{E}(Y^{\mathbf{Q}(\delta)})$ equals

[TABLE]

where $\mathcal{X}=\mathcal{X}_{1}\times\dots\times\mathcal{X}_{T}$ and $\mu(\mathbf{h}_{T},a_{T})=\mathbb{E}(Y\mid\mathbf{H}_{T}=\mathbf{h}_{T},A_{T}=a_{T})$ .

Proofs of all theorems are given in the Appendix. Theorem 1 follows from Robins’ g-formula [23], replacing the general treatment process under intervention with the proposed incremental intervention $q_{t}$ indexed by $\delta$ . The next corollary shows how the expression for $\psi(\delta)$ simplifies in point exposure studies.

Corollary 1.

When $T=1$ the identifying expression for $\psi(\delta)$ simplifies to

[TABLE]

with $\mu(\mathbf{x},a)=\mathbb{E}(Y\mid\mathbf{X}=\mathbf{x},A=a)$ .

This corollary shows that, when $T=1$ , the incremental effect $\psi(\delta)$ is a weighted average of the regression functions $\mu(\mathbf{x},1)$ and $\mu(\mathbf{x},0)$ , where the weight on $\mu(\mathbf{x},1)$ is given by the fluctuated intervention propensity score $q(\mathbf{x})=\delta\pi(\mathbf{x})/\{\delta\pi(\mathbf{x})+1-\pi(\mathbf{x})\}$ (and the weight on $\mu(\mathbf{x},0)$ is $1-q(\mathbf{x})$ ). This weight tends to zero as $\delta\rightarrow 0$ (whenever $\pi(\mathbf{x})<1$ ) and tends to one for $\delta\rightarrow\infty$ (whenever $\pi(\mathbf{x})>0$ ), showing again that $\delta$ controls how far away the intervention is from the observational treatment process. Incremental interventions can range from assigning no one to everyone treatment, but also include an infinite middle ground. Note that we can also write $\psi(\delta)=\mathbb{E}\{\mu(\mathbf{X},A^{*})\}$ where $A^{*}$ is a simulated version of treatment under the incremental intervention, with $(A^{*}\mid\mathbf{X}=\mathbf{x})\sim\text{Bernoulli}\{q(\mathbf{x})\}$ .

Beyond the fact that identifying incremental effects does not require positivity conditions, targeting $\psi(\delta)$ has another crucial advantage: it is always a one-dimensional curve, regardless of the number of timepoints $T$ , and even though it characterizes infinitely many interventions nonparametrically. In contrast, for more traditional causal effects, there is a distinct tension between the number of hypothetical interventions studied and the complexity of the effect. For example one could consider the mean outcome $\mathbb{E}(Y^{\overline{a}_{T}})$ under all $2^{T}$ deterministic interventions $\overline{a}_{T}\in\{0,1\}^{T}$ , but this requires exponentially many parameters without further assumptions. One could impose smoothness across the $2^{T}$ interventions to reduce the parameter space, but this will yield bias if the smoothness assumptions are incorrect. Conversely, describing the mean outcome under a small number of interventions such as $\overline{a}_{T}=\mathbf{0}$ and $\overline{a}_{T}=\mathbf{1}$ (i.e., never treated and always treated) requires only a few parameters, but gives a very limited picture of how changing treatment affects outcomes. In contrast, incremental interventions allow exploration of infinitely many interventions (one for each $\delta\in\mathcal{D}$ ), without any parametric assumptions, regardless of how large $T$ is, and still only yield a single curve $\psi:\mathcal{D}\mapsto\mathbb{R}$ that can be easily visualized with a plot.

3.3 Efficiency Theory

So far we have introduced incremental propensity score interventions, and showed that resulting effects can be identified without requiring positivity assumptions. Now we will develop general efficiency theory for the incremental effect $\psi(\delta)=\mathbb{E}(Y^{\mathbf{Q}(\delta)})$ .

We refer elsewhere [3, 40, 37, 35, 15] for more detailed information about nonparametric efficiency theory, and so give only a brief review here. A fundamental goal is characterizing so-called influence functions, and in particular finding the efficient influence function. These tasks are essential for a number of reasons. Perhaps most importantly, influence functions can be used to construct estimators with very favorable properties, such as double robustness or general second-order bias (called Neyman orthogonality by [5]). Estimators with these properties can attain fast parametric convergence rates, even in nonparametric settings where nuisance functions are estimated at slower rates via flexible machine learning. The efficient influence function (the only influence function in fully nonparametric models) is particularly important since its variance equals the efficiency bound, thus providing an important benchmark and allowing for the construction of optimal estimators. Influence functions are also critical for understanding the asymptotics of corresponding estimators, since by definition any regular asymptotically linear estimator can be expressed as the empirical average of an influence function plus a negligible $o_{p}(1/\sqrt{n})$ error term.

Mathematically, influence functions are essentially derivatives. More specifically, viewed as elements of the Hilbert space of mean-zero finite-variance functions, influence functions are those elements whose covariance with parametric submodel scores equals a pathwise derivative of the target parameter. Influence functions also correspond to the derivative in a Von Mises expansion of the target parameter (a distributional analog of a Taylor expansion), and in nonparametric models with discrete support they are a Gateaux derivative of the parameter in the direction of a point mass contamination.

The result of the next theorem is an expression for the efficient influence function for the incremental effect $\psi(\delta)$ under a nonparametric model, which allows the data-generating process $\mathbb{P}$ to be infinite-dimensional. This efficient influence function can be used to characterize the efficiency bound for estimating $\psi(\delta)$ , and we will see how this bound changes in randomized trial settings where the propensity scores are known. Then in the next section the efficient influence function will be used to construct estimators, including optimally efficient estimators with the second-order bias property discussed earlier.

Theorem 2.

The efficient influence function for $\psi(\delta)$ under a nonparametric model (with unknown propensity scores) is given by

[TABLE]

where for $t=0,...,T-1$ we define

[TABLE]

with $\mathcal{R}_{t}=(\mathcal{H}_{T}\times\mathcal{A}_{T})\setminus\mathcal{H}_{t}$ , and for $t=T$ we let $m_{T}(\mathbf{h}_{T},a_{T})=\mu(\mathbf{h}_{T},a_{T})$ .

We give a proof of Theorem 2 in Section 8.2 of the Appendix, by way of deriving the efficient influence function for general stochastic interventions with treatment distributions that depend on the observational propensity scores. To the best of our knowledge this result has not yet appeared in the literature, and will be useful for general stochastic interventions beyond those with the incremental form proposed here, regardless of whether they depend on the observational treatment process or not. Our result recovers previously proposed influence functions for other stochastic intervention effects in the $T=1$ setting as special cases [9, 12], and could be used to generalize this work to the multiple timepoint setting. Further, our result can also be used to construct the efficient influence function and corresponding estimator for other stochastic intervention effects, for which there are currently only likelihood-based and weighting estimators available [18, 44].

The structure of the efficient influence function in Theorem 2 is somewhat similar to that of more standard effect parameters, in the sense that it consists of an inverse-probability-weighted term (the rightmost product term in the second line) as well as an augmentation term. However the particular form of the weighted and augmentation terms are quite different from those that appear in more common causal and missing data problems. We discuss the weighted term in more detail in Section 4.1, when we introduce an inverse-probability-weighted estimator for $\psi(\delta)$ . The augmentation term involves the functions $m_{t}$ , which can be viewed as marginalized versions of the full regression function $\mu(\mathbf{h}_{t},a_{t})$ that conditions on all of the past (with smaller values of $t$ coinciding with more marginalization).

Note that for notational simplicity we drop the dependence of $m_{t}$ on $\delta$ and $(\pi_{t+1},...,\pi_{T})$ , as well as on the conditional densities of the covariates $(\mathbf{X}_{t+1},...,\mathbf{X}_{T})$ . Importantly, the pseudo-regression functions $m_{t}$ also have a recursive sequential regression formulation, as displayed in the subsequent remark.

*Remark 2**.*

The functions $m_{t}$ can be equivalently expressed recursively as

[TABLE]

for $t=1,...,T$ and $m_{T}(\mathbf{h}_{T},a_{T})=\mu(\mathbf{h}_{T},a_{T})$ as before.

Viewing the $m_{t}$ functions in the above sequential regression form is very practically useful for the purposes of estimation. Specifically it shows how to bypass conditional density estimation, and instead construct estimates $\hat{m}_{t}$ using regression methods that are more commonly found in statistical software.

It is also important to note that the pseudo-regressions $m_{t}$ depend on the observational treatment process; this is not the case for analogous influence function terms for more common parameters like $\mathbb{E}(Y^{\overline{a}_{T}})$ . This is due to the fact that the functional $\psi(\delta)$ itself depends on the observational treatment process, which means for example that double robustness is not possible (though second-order bias still is) and that the efficiency bound is different when the propensity scores are known versus unknown. The issue of double robustness is discussed in more detail in Section 4.3. In Lemmas 2 and 4 in the Appendix we give the efficient influence function when the propensity scores are known, as well as a specific expression for the contribution that comes from the scores being unknown, both for general (possibly non-incremental) stochastic interventions.

In the next corollary we give the efficient influence function for the incremental effect in a single timepoint study, which has a simpler and more intuitive form.

Corollary 2.

When $T=1$ the efficient influence function for $\psi(\delta)$ simplifies to

[TABLE]

where $\gamma(\mathbf{x})=\mu(\mathbf{x},1)-\mu(\mathbf{x},0)$ and

[TABLE]

is the uncentered efficient influence function for the parameter $\mathbb{E}\{\phi_{a}(\mathbf{Z})\}=\mathbb{E}\{\mu(\mathbf{X},a)\}$ .

The efficient influence function in the $T=1$ case is therefore a simple weighted average of the influence functions for $\mathbb{E}(Y^{1})$ and $\mathbb{E}(Y^{0})$ , plus a contribution that comes from the fact that the propensity score is unknown and must be estimated. If the propensity scores were known, the efficient influence function would just be the first weighted average term in Corollary 2. As will be discussed in more detail in the next section, estimating the influence function in the $T=1$ case is straightforward as it only depends on the regression function $\mu$ and propensity score $\pi$ (rather than the sequential psuedo-regression functions $m_{t}$ that appear in the longitudinal setting).

4 Estimation & Inference

In this section we develop estimators for the proposed incremental effect $\psi(\delta)$ . We focus our analysis on flexible sample-splitting estimators that allow arbitrarily complex nuisance estimation, e.g., via high-dimensional regression and machine learning methods; however we also discuss simpler estimators that rely on empirical process conditions to justify full-sample nuisance estimation. In particular we show that there exists an inverse-probability-weighted estimator of the incremental effect that is especially easy to compute. We go on to describe the asymptotic behavior of our proposed estimators, both from a pointwise perspective and uniformly across a continuum of increment parameter $\delta$ values. Finally we propose a computationally efficient multiplier-bootstrap approach for constructing uniform confidence bands across $\delta$ , and use it to develop a novel test of no treatment effect.

4.1 Simple Estimators

We first describe various simple estimators of the incremental effect, which provide some intuition for the main estimator we propose in the next section. The simple inverse-probability-weighted estimator discussed here might be preferred if the propensity scores can be modeled well (e.g., in a randomized trial) and computation comes at a high cost.

Let $\varphi(\mathbf{Z};\boldsymbol{\eta},\delta)$ denote the (uncentered) efficient influence function from Theorem 2, which is a function of the observations $\mathbf{Z}$ and the nuisance functions

[TABLE]

By uncentered we mean that $\varphi(\mathbf{Z};\boldsymbol{\eta},\delta)$ equals the quantity displayed in Theorem 2 plus the parameter $\psi(\delta)$ , so that $\mathbb{E}\{\varphi(\mathbf{Z};\boldsymbol{\eta},\delta)\}=\psi(\delta)$ by construction.

If one is willing to rely on appropriate empirical process conditions (e.g., Donsker-type or low entropy conditions, as discussed by [41], [39], and others) then a natural estimator would be given by the solution to the efficient influence function estimating equation, i.e., the Z-estimator

[TABLE]

where $\hat{\boldsymbol{\eta}}$ are some initial estimators of the nuisance functions, and $\mathbb{P}_{n}$ denotes the empirical measure so that sample averages can be written as $\frac{1}{n}\sum_{i}f(\mathbf{Z}_{i})=\mathbb{P}_{n}\{f(\mathbf{Z})\}=\int f(\mathbf{z})\ d\mathbb{P}_{n}(\mathbf{z})$ . An algorithm describing how to compute the estimator $\hat{\psi}^{*}(\delta)$ is given in Section 8.3 of the Appendix. As a special case, if the propensity scores $\pi_{t}$ can be correctly modeled parametrically (e.g., when they are known as in a randomized trial) then one could use the simple inverse-probability-weighted estimator given by

[TABLE]

This estimator can be computed very quickly, as it only requires fitting a single pooled regression to estimate $\pi_{t}$ and then taking a weighted average. However it has disadvantages, as will be discussed shortly. Also note that it is a special case of $\hat{\psi}^{*}(\delta)$ that sets $\hat{m}_{t}=0$ .

It is instructive to compare the inverse-probability-weighted estimator above to that for a usual deterministic static intervention effect like $\mathbb{E}(Y^{\overline{a}_{T}})$ . For example, the inverse-probability-weighted estimator of the quantity $\mathbb{E}(Y^{\mathbf{1}})$ weights each always-treated unit by the (inverse) product of propensity scores $\prod_{t}\hat{\pi}_{t}$ , and otherwise assigns zero weight. In contrast, when $\delta>1$ the estimator $\hat{\psi}_{ipw}^{*}(\delta)$ weights each treated time by the (inverse of the) propensity score plus some fractional contribution of its complement, i.e., $\hat{\pi}_{t}+(1-\hat{\pi}_{t})/\delta$ , where the size of the contribution decreases with $\delta$ ; untreated times are weighted by this same amount, except the entire weight is further downweighted by a factor of $\delta$ . Therefore when $\delta$ is very large, the two inverse-probability-weighted estimators coincide. However, for cases when $\delta$ is not very large, this also indicates why the estimator $\hat{\psi}_{ipw}^{*}(\delta)$ is immune to extreme weights: even if $\hat{\pi}_{t}$ is very small, there will still be a contribution to the weight that moves it away from zero.

4.2 Proposed Estimator

Although the estimators presented in the previous section are relatively simple, they have some disadvantages. First, the inverse-probability-weighted estimator $\hat{\psi}_{ipw}^{*}(\delta)$ will in general not be $\sqrt{n}$ -consistent unless all the propensity scores are estimated with correctly specified parametric models; this is typically an unreasonable assumption outside of randomized trials where propensity scores are known. In point exposure studies with a single timepoint, (saturated) parametric models might be used if the adjustment covariates are low-dimensional. However, in studies with more than just a few timepoints, the histories $\mathbf{H}_{t}$ can easily be high-dimensional even if the covariates $\mathbf{X}_{t}$ are low-dimensional, making parametric modeling assumptions less tenable even in the low-dimensional $\mathbf{X}_{t}$ case.

In contrast, the more general Z-estimator $\hat{\psi}^{*}(\delta)$ can converge at fast parametric $\sqrt{n}$ rates (and attain the efficiency bound from Section 3.3), even when the propensity scores $\pi_{t}$ and pseudo-outcome regressions $m_{t}$ are modeled flexibly and estimated at rates slower than $\sqrt{n}$ , as long as these nuisance functions are estimated consistently at rates faster than $n^{1/4}$ . Lowering the bar from $\sqrt{n}$ to $n^{1/4}$ for the nuisance estimator convergence rate allows much more flexible nonparametric methods to be employed; for example these rates are attainable under smoothness, sparsity, or other nonparametric structural constraints. However, as mentioned earlier, these Z-estimator properties require some empirical process conditions that restrict the flexibility and complexity of the nuisance estimators. This is essentially because $\hat{\psi}^{*}(\delta)$ uses the sample twice, once for estimating the nuisance functions $\boldsymbol{\eta}$ and again for evaluating the influence function $\varphi$ . Without restricting the entropy of the nuisance estimators, using the full sample in this way can result in overfitting and intractable asymptotics. Unfortunately, the required empirical process conditions may not be satisfied by many modern regression methods, such as random forests, boosting, deep learning, or complicated ensembles.

In order to accommodate the added complexity of these modern machine learning tools, we use sample splitting [45, 5]. This avoids the problematic “double” use of the sample and, as will be seen in the next section, yields asymptotically normal and efficient estimators without any restrictions on the complexity of the nuisance estimators (however, $n^{1/4}$ -type rate conditions are still required).

Therefore we randomly split the observations $(\mathbf{Z}_{1},...,\mathbf{Z}_{n})$ into $K$ disjoint groups, using a random variable $S$ drawn independently of the data, where $S_{i}\in\{1,...,K\}$ denotes the group membership for unit $i$ . Then our proposed estimator is given by

[TABLE]

where we let $\mathbb{P}_{n}^{k}$ denote empirical averages only over the set of units $\{i:S_{i}=k\}$ in group $k$ (i.e., $\mathbb{P}_{n}\{f(\mathbf{Z})\}=\sum_{i}f(\mathbf{Z}_{i})\mathbbold{1}(S_{i}=k)/\sum_{i}\mathbbold{1}(S_{i}=k)$ ), and we let $\hat{\boldsymbol{\eta}}_{\text{-}k}$ denote the nuisance estimator constructed excluding group $k$ , i.e., only using those units $\{i:S_{i}\neq k\}$ in groups $\mathcal{K}\setminus k$ . It is hoped that $\hat{\boldsymbol{\eta}}_{\text{-}k}$ is a rate-optimal estimator of the nuisance functions, for example constructed using kernels, splines, penalized regression, boosting, random forests, etc., or some ensemble-based combination.

An algorithm detailing exactly how to compute the estimator $\hat{\psi}(\delta)$ is given as follows. For reference, the algorithm for the non-sample splitting estimator $\hat{\psi}^{*}(\delta)$ is also given in Section 8.3 of the Appendix and contains the main ideas.

Algorithm 1.

For each $\delta$ and $k$ , letting $\mathbf{D}_{0}=\{\mathbf{Z}_{i}:S_{i}\neq k\}$ and $\mathbf{D}_{1}=\{\mathbf{Z}_{i}:S_{i}=k\}$ denote corresponding training and test data, respectively, and $\mathbf{D}=\mathbf{D}_{0}\cup\mathbf{D}_{1}$ :

Regress $A_{t}$ on $\mathbf{H}_{t}$ in $\mathbf{D}_{0}$ , obtain predicted values $\hat{\pi}_{t}(\mathbf{H}_{t})$ for each subject/time in $\mathbf{D}$ . 2. 2.

Construct time-dependent weights $W_{t}=\frac{\delta A_{t}+1-A_{t}}{\delta\hat{\pi}_{t}(\mathbf{H}_{t})+1-\hat{\pi}_{t}(\mathbf{H}_{t})}$ in $\mathbf{D}_{1}$ for each subject/time. 3. 3.

Calculate cumulative product weight $\widetilde{W}_{t}=\prod_{s=1}^{t}W_{s}$ in $\mathbf{D}_{1}$ for each subject/time. 4. 4.

For each time $t=T,T-1,...,1$ (starting with $R_{T+1}=Y$ ):

(a)

Regress $R_{t+1}$ on $(\mathbf{H}_{t},A_{t})$ in $\mathbf{D}_{0}$ , obtain predictions $\hat{m}_{t}(\mathbf{H}_{t},1)$ , $\hat{m}_{t}(\mathbf{H}_{t},0)$ in $\mathbf{D}$ . 2. (b)

Construct pseudo-outcome $R_{t}=\frac{\delta\hat{\pi}_{t}(\mathbf{H}_{t})\hat{m}_{t}(\mathbf{H}_{t},1)+\{1-\hat{\pi}_{t}(\mathbf{H}_{t})\}\hat{m}_{t}(\mathbf{H}_{t},0)}{\delta\hat{\pi}_{t}(\mathbf{H}_{t})+1-\hat{\pi}_{t}(\mathbf{H}_{t})}$ in $\mathbf{D}$ . 5. 5.

Compute time-dependent weights $V_{t}=\frac{A_{t}\{1-\hat{\pi}_{t}(\mathbf{H}_{t})\}-(1-A_{t})\delta\hat{\pi}_{t}(\mathbf{H}_{t})}{\delta/(1-\delta)}$ in $\mathbf{D}_{1}$ . 6. 6.

Compute $\varphi=\widetilde{W}_{T}Y+\sum_{t}\widetilde{W}_{t}V_{t}R_{t}$ in $\mathbf{D}_{1}$ and define $\hat{\psi}_{k}(\delta)$ to be its average in $\mathbf{D}_{1}$ .

Finally, set $\hat{\psi}(\delta)$ to be the average of the $K$ estimators $\hat{\psi}_{k}(\delta)$ , $k=1,...,K$ .

Importantly, computing $\hat{\psi}(\delta)$ only requires estimating regression functions (e.g., using random forests) and not conditional densities, due to the recursive regression formulation of the functions $m_{t}$ in Remark 2. Although the process can be somewhat computationally expensive depending on the number of timepoints $T$ , sample size $n$ , and grid density for $\delta$ , it is easily parallelizable due to the sample splitting. For a single timepoint all estimators are easy and fast to compute. In Section 8.6 of the Appendix, we provide a user-friendly R function for general use in cross-sectional or longitudinal studies; the function can also be found in the npcausal R package available at GitHub (github.com/ehkennedy/npcausal).

4.3 Weak Convergence

In this section we detail the main large-sample property of our proposed estimator, that $\hat{\psi}(\delta)$ is $\sqrt{n}$ -consistent and asymptotically normal under weak conditions (mostly only requiring that the nuisance functions are estimated at faster than $n^{1/4}$ rates). This result holds both pointwise for a given $\delta$ , and uniformly in the sense that, after scaling and when viewed as a random function on $\mathcal{D}=[\delta_{\ell},\delta_{u}]$ , the estimator converges in distribution to a Gaussian process. The latter fact is crucial for developing uniform confidence bands, as well as the test of no treatment effect we present in the next section. Importantly, the estimator attains fast $\sqrt{n}$ rates even under nonparametric assumptions and even though the target parameter is a curve; this is often not possible [17, 16].

In what follows we denote the squared $L_{2}(\mathbb{P})$ norm by $\|f\|^{2}={\int f(\mathbf{z})^{2}\ d\mathbb{P}(\mathbf{z})}$ . When necessary, we depart slightly from previous sections and index the pseudo-regression functions $m_{t,\delta}$ (and their estimators $\hat{m}_{t,\delta}$ ) by both time $t$ and the increment parameter $\delta$ . The next result lays the foundation for our proposed inferential and testing procedures.

Theorem 3.

Let $\hat{\sigma}^{2}(\delta)=\mathbb{P}_{n}[\{\varphi(\mathbf{Z};\hat{\boldsymbol{\eta}}_{\text{-}S},\delta)-\hat{\psi}(\delta)\}^{2}]$ denote the estimator of the variance function $\sigma^{2}(\delta)=\mathbb{E}[\{\varphi(\mathbf{Z};\boldsymbol{\eta},\delta)-\psi(\delta)\}^{2}]$ . Assume:

The set $\mathcal{D}=[\delta_{\ell},\delta_{u}]$ is bounded with $0<\delta_{\ell}\leq\delta_{u}<\infty$ . 2. 2.

$\mathbb{P}\{|m_{t}(\mathbf{H}_{t},A_{t})|\leq C\}=\mathbb{P}\{|\hat{m}_{t}(\mathbf{H}_{t},A_{t})|\leq C\}=1$ * for some $C<\infty$ and all $t$ .* 3. 3.

$\sup_{\delta\in\mathcal{D}}|\frac{\hat{\sigma}(\delta)}{\sigma(\delta)}-1|=o_{\mathbb{P}}(1)$ , and $\|\sup_{\delta\in\mathcal{D}}|\varphi(\mathbf{z};\hat{\boldsymbol{\eta}},\delta)-\varphi(\mathbf{z};\boldsymbol{\eta},\delta)|\ \!\|=o_{\mathbb{P}}(1)$ . 4. 4.

$\Big{(}\sup_{\delta\in\mathcal{D}}\|\hat{m}_{t,\delta}-m_{t,\delta}\|+\|\hat{\pi}_{t}-\pi_{t}\|\Big{)}\|\hat{\pi}_{s}-\pi_{s}\|=o_{\mathbb{P}}(1/\sqrt{n})$ * for $s\leq t\leq T$ .*

Then

[TABLE]

in $\ell^{\infty}(\mathcal{D})$ , where $\mathbb{G}(\cdot)$ is a mean-zero Gaussian process with covariance $\mathbb{E}\{\mathbb{G}(\delta_{1})\mathbb{G}(\delta_{2})\}=\mathbb{E}\{\widetilde{\varphi}(\mathbf{Z};\boldsymbol{\eta},\delta_{1})\widetilde{\varphi}(\mathbf{Z};\boldsymbol{\eta},\delta_{2})\}$ and $\widetilde{\varphi}(\mathbf{z};\boldsymbol{\eta},\delta)=\{\varphi(\mathbf{z};\boldsymbol{\eta},\delta)-\psi(\delta)\}/\sigma(\delta)$ .

The proof of Theorem 3 is given in Section 8.4 of the Appendix. The logic of the proof is roughly similar to that used by [2], but we avoid their restrictions on nuisance function entropy by sample-splitting and arguing conditionally on the training data. This allows for the use of arbitrarily complex estimators $\hat{\boldsymbol{\eta}}$ , such as random forests, boosting, etc. We also do not need explicit smoothness assumptions on $\psi(\delta)$ or $\varphi(\mathbf{Z};\boldsymbol{\eta},\delta)$ since they are necessarily Lipschitz in $\delta$ by construction, based on our choice of the incremental intervention distribution $q_{t}$ .

Assumptions 1–2 of Theorem 3 are mild boundedness conditions on the set $\mathcal{D}$ of $\delta$ values and the functions $m_{t}$ and their estimators, respectively. Assumption 2 could be relaxed at the expense of a less simple proof, for example with bounds on $L_{p}$ norms. Assumption 3 is a basic and mild consistency assumption, with no requirement on rates of convergence. The main substantive assumption is Assumption 4, which says the nuisance estimators must be consistent and converge at a fast enough rate (essentially $n^{1/4}$ in $L_{2}$ norm).

Importantly, the rate condition in Assumption 4 can be attained under nonparametric smoothness, sparsity, or other structural constraints. We are agnostic about how such rates might be attained since the particular required assumptions are problem-dependent; in practice we suggest using ensemble learners that can adapt to diverse kinds of structure. The particular form of the rate requirement indicates that double robustness is not possible, since we need products of the form $\|\hat{\pi}_{t}-\pi_{t}\|\|\hat{\pi}_{s}-\pi_{s}\|$ to be small, thus requiring consistent estimation of the propensity scores (albeit only at slower than parametric rates). If the propensity scores are known as in a randomized trial, then Assumption 4 will necessarily hold; in this case, the result of the theorem follows with $\varphi(\mathbf{z};\boldsymbol{\eta},\delta)$ evaluated at $\overline{m}_{t}$ the limit of the estimator $\hat{m}_{t}$ , which may or may not equal the true pseudo-regression $m_{t}$ . If the propensity scores are estimated with correct parametric models, then Assumption 4 would only require a (uniformly) consistent estimator of $m_{t}$ , without any rate conditions.

Based on the result in Theorem 3, pointwise 95% confidence intervals for $\psi(\delta)$ can be constructed as

[TABLE]

where $\hat{\sigma}^{2}(\delta)=\mathbb{P}_{n}[\{\varphi(\mathbf{Z};\hat{\boldsymbol{\eta}}_{\text{-}S},\delta)-\hat{\psi}(\delta)\}^{2}]$ is the variance estimator given in the statement of the theorem. Uniform inference and testing is discussed in the next section.

4.4 Uniform Inference & Testing No Effect

In this section we present a multiplier bootstrap approach to obtaining uniform confidence bands for the incremental effect curve $\{\psi(\delta):\delta\in\mathcal{D}\}$ , along with a corresponding novel test of no treatment effect. This test can be useful in general causal inference problems, even when positivity assumptions are justified and even if incremental effects are not of particular interest.

To construct a $(1-\alpha)$ uniform confidence band of the form $\hat{\psi}(\delta)\pm c_{\alpha}\hat{\sigma}(\delta)/\sqrt{n}$ , as usual we need to find a critical value $c_{\alpha}$ that satisfies

[TABLE]

since the expression on the left is the probability that the band covers the true incremental effect curve $\psi(\delta)$ for all $\delta\in\mathcal{D}$ .

Based on the result of Theorem 3, this critical value can be obtained by approximating the distribution of the supremum of the Gaussian process $\{\mathbb{G}(\delta):\delta\in\mathcal{D}\}$ with covariance function as given in the statement of the theorem. We use the multiplier bootstrap [11, 41, 2] to approximate this distribution. A primary advantage of the multiplier bootstrap is its computational efficiency, since it does not require refitting the nuisance estimators, which can be expensive when there are many covariates and/or timepoints.

The idea behind the multiplier bootstrap is to approximate the distribution of the aforementioned supremum with the supremum of the multiplier process

[TABLE]

over draws of the multipliers $(\xi_{1},...,\xi_{n})$ (conditional on the sample data $\mathbf{Z}_{1},...,\mathbf{Z}_{n}$ ), which are iid random variables with mean zero and unit variance that are independent of the sample. Typically one uses either Gaussian or Rademacher multipliers (i.e., $\mathbb{P}(\xi=1)=\mathbb{P}(\xi=-1)=0.5$ ); we use Rademacher multipliers because they gave better performance in simulations. The next theorem states that this approximation works under the same assumptions from Theorem 3.

Theorem 4.

Let $\hat{c}_{\alpha}$ denote the $1-\alpha$ quantile (conditional on the data) of the supremum of the multiplier bootstrap process, i.e.,

[TABLE]

where $(\xi_{1},...,\xi_{n})$ are iid Rademacher random variables independent of the sample. Then, under the same conditions from Theorem 3,

[TABLE]

The proof of Theorem 4 is given in Section 8.5 of the Appendix, and follows by linking the multiplier bootstrap process to the same Gaussian process $\mathbb{G}$ to which the scaled estimator $\hat{\psi}(\delta)$ converges. As mentioned above, the multiplier bootstrap only requires simulating the multipliers $\xi$ and not re-estimating the nuisance functions, so it is straightforward and fast to implement. We include an implementation in the R function given in Section 8.6 of the Appendix, as well as in the npcausal R package available at GitHub (github.com/ehkennedy/npcausal).

Given the above uniform confidence band, we can test the null hypothesis of no incremental intervention effect

[TABLE]

by simply checking whether a $(1-\alpha)$ band contains a straight line over $\mathcal{D}$ . In other words we can compute a p-value as

[TABLE]

Note that the condition in the above set corresponds to failing to reject $H_{0}$ at level $\alpha$ , since there is space for a straight line between the smallest upper confidence limit and largest lower confidence limit. We will necessarily fail to reject at level $\alpha=0$ since this amounts to an infinitely wide confidence band, and the p-value is the largest $\alpha$ at which we fail to reject (i.e., the p-value is small if we reject even for wide bands, and large if we need to move to narrower bands or never reject).

Interestingly, the hypothesis we test above lies in a middle ground between Fisher’s null of no individual effect and Neyman’s null of no average effect. $H_{0}$ is a granular hypothesis perhaps closer to Fisher’s null than Neyman’s, but it can still be tested nonparametrically and in a longitudinal superpopulation framework. This is in contrast to common tests of Fisher’s null that operate under additive effect hypotheses and are limited to point exposures [29]. Thus tests of the null $H_{0}$ can be useful in general settings, independent of any interest in pursuing incremental intervention effects or avoiding positivity assumptions.

5 Illustrations

5.1 Simulation Study

Here we explore finite-sample properties via simulation, based on the simulation setup used by [14]. In particular we consider their model

[TABLE]

where the regression function is given by $\mu(\mathbf{x},a)=200+a\{10+13.7(2x_{1}+x_{2}+x_{3}+x_{4})\}$ . This simulation setup is known to yield variable propensity scores that can degrade the performance of weighting-based estimators.

We considered three estimators in our simulation: a plug-in estimator given by

[TABLE]

along with the inverse-probability-weighted (IPW) estimator and proposed efficient estimator described in Sections 4.1–4.2. We further considered four versions of each these estimators, depending on how the nuisance functions were estimated: correct parametric models, misspecified parametric models based on transformed covariates $\mathbf{X}^{*}$ (using the same covariate transformations as [14]), and nonparametric estimation (using original or transformed covariates). For nonparametric estimation we used the cross-validation-based Super Learner ensemble [38] to combine generalized additive models, multivariate adaptive regression splines, support vector machines, and random forests, along with parametric models (with and without interactions, and with terms selected stepwise via AIC). Regardless of estimator (plug-in, IPW, or proposed), for nonparametric nuisance estimation we used sample splitting as described in Section 4.2 with $K=2$ splits.

Estimator performance was assessed via integrated bias and root-mean-squared error

[TABLE]

across $J=500$ simulations and $I=100$ values of $\delta$ equally spaced (on the log scale) between $\exp(-2.3)\approx 0.1$ and $\exp(2.3)\approx 10$ . Results are given in Figure 2.

In each setting, the proposed estimator performed as well or better than the plug-in and IPW versions. When the nuisance functions were estimated with correct parametric models, all methods gave small bias and RMSE, with the plug-in and proposed estimators slightly outperforming the IPW estimator in terms of RMSE. Under parametric misspecification, bias and RMSE were amplified for all estimators and the plug-in fared worst. A more interesting (but expected) story appeared with nonparametric nuisance estimation. There, the plug-in and IPW estimators show large bias and RMSE, since they are not expected to converge at $\sqrt{n}$ rates; in contrast, the proposed efficient estimator essentially matches its behavior when constructed based on correct parametric models (with only a slight loss in RMSE). This is indicative of the fact that the proposed estimator only requires $n^{1/4}$ rates on nuisance estimation to achieve full efficiency and in general has second-order bias. This behavior appears to hold in our simulations even for nonparametric estimation using $\mathbf{X}^{*}$ , i.e., when the true model is not used directly.

We also assessed the uniform coverage of our proposed multiplier bootstrap confidence bands (as usual, we say a band covers if it contains the true curve entirely for all $\delta\in\mathcal{D}$ ). Results are given in Table 1. As expected, coverage is very poor when nuisance functions are estimated with misspecified parametric models. Coverage was near the nominal level (95%) in large samples as long as nuisance functions were estimated with correct parametric models or nonparametrically using the non-transformed covariates $\mathbf{X}$ (coverage was slightly diminished for nonparametric nuisance estimation based on the misspecified $\mathbf{X}^{*}$ ).

5.2 Application

Here we illustrate the use of incremental intervention effects with a reanalysis of the National Longitudinal Survey of Youth 1997 data used by [13], [1], and others to study the effects of incarceration on marriage. Incarceration is a colossal industry in the United States, with over 2.3 million people currently confined in a correctional facility and at least twice that number held on probation or parole [42]. There is a large literature on unintended effects of this mass incarceration, with numerous studies pointing to negative impacts on various aspects of employment, health, social ties, psychology, and more [21, 7]. Effects of incarceration on marriage are important since marriage is expected to yield, for example, better family and social support, better outcomes for children, and less recidivism, among other benefits [13, 7]. [1] were the first to study this question while specifically accounting for time-varying confounders, such as employment and earnings, and we refer there for more motivation and background.

The National Longitudinal Survey of Youth 1997 data consists of yearly measures across 14 timepoints, from 1997 to 2010, for participants who were 12–16 years old at the initial survey. The data include demographic information (e.g., age, race, gender, parent’s education), various delinquency indicators (e.g., age at first sex, measures of drug use and gang membership, delinquency scores), as well as numerous time-varying measures (e.g., employment, earnings, marriage and incarceration history). Following [1], we use the final 10 timepoints from 2001–2010, restrict the analysis to the 4781 individuals with a non-zero delinquency score at baseline, and use as outcome $Y$ the indicator of marriage at the end of the study (i.e., in 2010).

[1] used a standard marginal structural model approach to study effects of static incarceration trajectories, which has some limitations. First, it requires a parametric model to describe how incarceration trajectories affect marriage rates. In particular [1] used $\mathbb{E}(Y^{\overline{a}_{T}})=\text{expit}(\beta_{0}+\beta_{1}\sum_{t}a_{t})$ , which only allows marriage prevalence to depend on total time spent incarcerated. This kind of assumption is very common in practice but is quite restrictive, especially since a saturated structural model in this case would have $2^{10}=1024$ parameters, instead of only two. Hence the data only inform $2/1024=0.2\%$ of the possible parameter values. In fact if the model is slightly elaborated, e.g., to $\mathbb{E}(Y^{\overline{a}_{T}})=\text{expit}(\beta_{0}+\sum_{t}\beta_{1t}a_{t})$ so that $\beta_{1}$ can vary with time, then a standard weighting estimator fails and no coefficient estimates can be found. Another limitation is that [1] used parametric inverse probability weighting to estimate $(\beta_{0},\beta_{1})$ (partly for pedagogic purposes), but this is both inefficient and likely biased due to propensity score model misspecification. Perhaps most importantly, a standard marginal structural model setup requires imagining sending all or none of the study participants to prison at each time. However, positivity is likely violated here since some individuals may be necessarily incarcerated at some times (e.g., due to multiple-year sentences) or have essentially zero chance of incarceration (based on demographic or other characteristics). These limitations are not at all unique to the analysis of [1], but instead are common to many observational marginal structural model analyses; we build on their analysis by instead estimating incremental incarceration effects, which require neither any parametric models nor any positivity assumptions.

Specifically we estimated the incremental effect curve $\psi(\delta)$ , which in this setting represents the marriage prevalence at the end of the study if the odds of incarceration were multiplied by factor $\delta$ . We used Random Forests (via the ranger package in R) to estimate all nuisance functions $\pi_{t}$ and $m_{t}$ as described in Algorithm 1 (with $K=10$ -fold sample splitting), and computed pointwise and uniform confidence bands as in Sections 4.3 and 4.4 (with 10,000 bootstrap replications). Results are shown in Figure 3.

We find strong evidence (assuming no unmeasured confounding and consistency) that incarceration negatively impacts marriage rates. First, we reject the null hypothesis of no incremental effect of incarceration on marriage ( $p=0.049$ ) over the range $\delta\in[0.2,5]$ . More specifically we estimate that, if incarceration odds were increased proportionally for all individuals, marriage prevalence would drop from $\mathbb{P}_{n}(Y)=29.4\%$ observationally to 28.1% if the odds doubled (OR=0.94, 95% CI: 0.87–1.00), and to 23.6% if the odds were multiplied four-fold (OR=0.74, 95% CI: 0.59–0.91). Conversely, we estimate that marriage prevalence would only increase to 29.7% if the odds of incarceration were halved (OR=1.01, 95% CI: 0.95–1.08); the prevalence and odds ratio are the same if the odds were quartered. These results suggest that marriage rates might be more affected by increased rather than decreased incarceration (i.e., the curve in Figure 3 is nonlinear, with larger slope for $\delta>1$ ). This analysis provides considerably more nuance than a simple marginal structural model fit, and requires none of the parametric and positivity assumptions.

6 Discussion

In this paper we have proposed incremental intervention effects, which are based on shifting propensity scores rather than setting treatment values. We showed that these effects can be identified and estimated without any positivity or parametric assumptions, established general efficiency theory, and constructed influence-function-based estimators that yield fast rates of convergence even when based on flexible nonparametric regression tools. We also developed an approach for uniform inference and a new test of no treatment effect, and applied the methods in a longitudinal study of incarceration effects on marriage.

There are a few caveats to our developments that are worth mentioning. First, we expect incremental intervention effects to play a more descriptive than prescriptive role compared to other approaches. Specifically, they give an interpretable picture of what would happen if exposure were increased or decreased in a natural way, but will likely be less useful for informing specific treatment decisions. For example in our analysis from Section 5.2 the goal was to better understand the overall societal effects of mass incarceration; in cases where the goal is to learn how to best assign treatment, methods for optimal treatment regime estimation will likely be more relevant. However, note that it is certainly possible to estimate the optimal incremental regime $q_{t}(\mathbf{h}_{t};\delta^{*},\pi_{t})$ for $\delta^{*}=\operatorname*{arg\,max}_{\delta}\mathbb{E}(Y^{\mathbf{Q}(\delta)})$ ; so in theory incremental effects could be used to construct specific treatment decision rules.

Another caveat is that, in favor of computational efficiency, we have bypassed concerns about model compatibility when estimating the pseudo-regression functions $m_{t}$ . It can be difficult to formulate models for all the $m_{t}$ functions that are compatible with each other, since $m_{t}$ has a complicated dependence on $m_{t+1}$ (as well as the propensity scores $\pi_{t+1}$ and covariate densities). To make these estimators fully compatible, we would need to model the conditional densities of the (high-dimensional) covariates and construct $\hat{m}_{t}$ based on the non-recursive expression in Theorem 2. However, we feel that if flexible enough estimators for $m_{t}$ are used, then model incompatibility will likely not be a major concern in practice, particularly relative to the computational benefits. This issue also arises in estimating standard longitudinal causal effects [32, 20].

In future work we plan to pursue various extensions of incremental intervention effects. For example, it will be important to consider (i) interventions with increment parameters $\delta=\delta(\mathbf{h}_{t})$ that depend on time and past covariate history, (ii) estimation of how mean outcomes under different interventions vary with covariates (effect modification), (iii) extensions for settings with multivalued treatments and/or censored outcomes, and (iv) increment parameters based on risk ratios or other shifts, rather than odds ratios.

[1] Valerio Bacak and Edward H Kennedy

“Marginal structural models: An application to incarceration and marriage during young adulthood”

In Journal of Marriage and Family 77.1

Wiley Online Library, 2015, pp. 112–125

[2] Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov and Ying Wei

“Uniformly valid post-regularization confidence regions for many functional parameters in Z-estimation framework”

In arXiv preprint arXiv:1512.07619, 2015

[3] Peter J Bickel, Chris AJ Klaassen, Ya’acov Ritov and Jon A Wellner

“Efficient and Adaptive Estimation for Semiparametric Models”

Johns Hopkins University Press, 1993

[4] Lauren E Cain et al.

“When to start treatment? A systematic approach to the comparison of dynamic regimes using observational data”

In The International Journal of Biostatistics 6.2, 2010, pp. 1–24

[5] Victor Chernozhukov et al.

“Double machine learning for treatment and causal parameters”

In arXiv preprint arXiv:1608.00060, 2016, pp. 1–37

[6] Victor Chernozhukov, Denis Chetverikov and Kengo Kato

“Gaussian approximation of suprema of empirical processes”

In The Annals of Statistics 42.4

Institute of Mathematical Statistics, 2014, pp. 1564–1597

[7] Todd R Clear

“Imprisoning communities: How mass incarceration makes disadvantaged neighborhoods worse”

Oxford University Press, 2009

[8] Iván Díaz and Mark J van der Laan

“Assessing the causal effect of policies: an example using stochastic interventions”

In The International Journal of Biostatistics 9.2, 2013, pp. 161–174

[9] Iván Díaz and Mark J van der Laan

“Population intervention causal effects based on stochastic interventions”

In Biometrics 68.2

Wiley Online Library, 2012, pp. 541–549

[10] Miroslav Dudík, Dumitru Erhan, John Langford and Lihong Li

“Doubly robust policy evaluation and optimization”

In Statistical Science 29.4

Institute of Mathematical Statistics, 2014, pp. 485–511

[11] Evarist Giné and Joel Zinn

“Some limit theorems for empirical processes”

In The Annals of Probability 12.4

JSTOR, 1984, pp. 929–989

[12] S Haneuse and A Rotnitzky

“Estimation of the effect of interventions that modify the received treatment”

In Statistics in Medicine 32.30

Wiley Online Library, 2013, pp. 5260–5277

[13] Beth M Huebner

“The effect of incarceration on marriage and work over the life course”

In Justice Quarterly 22.3

Taylor & Francis, 2005, pp. 281–303

[14] Joseph DY Kang and Joseph L Schafer

“Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data”

In Statistical Science 22.4

Institute of Mathematical Statistics, 2007, pp. 523–539

[15] Edward H Kennedy

“Semiparametric theory and empirical processes in causal inference”

In Statistical Causal Inferences and Their Applications in Public Health Research

Springer, 2016, pp. 141–167

[16] Edward H Kennedy, Zongming Ma, Matthew D McHugh and Dylan S Small

“Nonparametric methods for doubly robust estimation of continuous treatment effects”

In Journal of the Royal Statistical Society: Series B 79.4, 2017, pp. 1229–1245

[17] Edward H Kennedy, Scott A Lorch and Dylan S Small

“Robust causal inference with continuous instruments using the local instrumental variable curve”

In arXiv preprint arXiv:1607.02566, 2016

[18] Kelly L Moore, Romain Neugebauer, Mark J van der Laan and Ira B Tager

“Causal inference in epidemiological studies with strong confounding”

In Statistics in Medicine 31.13

Wiley Online Library, 2012, pp. 1380–1404

[19] Susan A Murphy

“Optimal dynamic treatment regimes”

In Journal of the Royal Statistical Society: Series B 65.2

Wiley Online Library, 2003, pp. 331–355

[20] Susan A Murphy, Mark J van der Laan and James M Robins

“Marginal mean models for dynamic regimes”

In Journal of the American Statistical Association 96.456

Taylor & Francis, 2001, pp. 1410–1423

[21] Mary Pattillo, Bruce Western and David Weiman

“Imprisoning America: The social effects of mass incarceration”

Russell Sage Foundation, 2004

[22] Judea Pearl

“Causality: Models, Reasoning, & Inference”

Cambridge Univ. Press, 2009

[23] James M Robins

“A new approach to causal inference in mortality studies with a sustained exposure period: application to control of the healthy worker survivor effect”

In Mathematical Modelling 7.9-12

Elsevier, 1986, pp. 1393–1512

[24] James M Robins

“Marginal structural models versus structural nested models as tools for causal inference”

In Statistical Models in Epidemiology, the Environment, and Clinical Trials

Springer, 2000, pp. 95–133

[25] James M Robins

“Optimal structural nested models for optimal sequential decisions”

In Proceedings of the Second Seattle Symposium in Biostatistics, 2004, pp. 189–326

Springer New York

[26] James M Robins, Miguel A Hernán and Uwe Siebert

“Effects of multiple interventions”

In Comparative Quantification of Health Risks

Citeseer, 2004, pp. 2191–2230

[27] James M Robins, Liliana Orellana and Andrea Rotnitzky

“Estimation and extrapolation of optimal treatment and testing strategies”

In Statistics in Medicine 27.23

Wiley Online Library, 2008, pp. 4678–4721

[28] James M Robins, Miguel Angel Hernán and Babette Brumback

“Marginal structural models and causal inference in epidemiology”

In Epidemiology 11.5

Lippincott Williams & Wilkins, 2000, pp. 550–560

[29] Paul R Rosenbaum

“Covariance adjustment in randomized experiments and observational studies (with discussion)”

In Statistical Science 17.3

Institute of Mathematical Statistics, 2002, pp. 286–327

[30] Paul R Rosenbaum and Donald B Rubin

“The central role of the propensity score in observational studies for causal effects”

In Biometrika 70.1

JSTOR, 1983, pp. 41–55

[31] Donald B Rubin

“Estimating causal effects of treatments in randomized and nonrandomized studies.”

In Journal of Educational Psychology 66.5

American Psychological Association, 1974, pp. 688–701

[32] Daniel O Scharfstein, Andrea Rotnitzky and James M Robins

“Adjusting for nonignorable drop-out using semiparametric nonresponse models”

In Journal of the American Statistical Association 94.448

Taylor & Francis, 1999, pp. 1096–1120

[33] Sarah L Taubman, James M Robins, Murray A Mittleman and Miguel A Hernán

“Intervening on risk factors for coronary heart disease: an application of the parametric g-formula”

In International Journal of Epidemiology 38.6

IEA, 2009, pp. 1599–1611

[34] Jin Tian

“Identifying dynamic sequential plans”

In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, 2008

[35] Anastasios A Tsiatis

“Semiparametric Theory and Missing Data”

Springer, 2006

[36] Mark J van der Laan and Maya L Petersen

“Causal effect models for realistic individualized treatment and intention to treat rules”

In The International Journal of Biostatistics 3.1, 2007, pp. 1–52

[37] Mark J van der Laan and James M Robins

“Unified Methods for Censored Longitudinal Data and Causality”

Springer, 2003

[38] Mark J van der Laan, Eric C Polley and Alan E Hubbard

“Super learner”

In Statistical Applications in Genetics and Molecular Biology 6.1, 2007, pp. 1–21

[39] Aad W van der Vaart

“Asymptotic Statistics”

Cambridge University Press, 2000

[40] Aad W van der Vaart

“Semiparametric statistics”

In In: Lectures on Probability Theory and Statistics

Springer, 2002, pp. 331–457

[41] Aad W van der Vaart and Jon A Wellner

“Weak Convergence and Empirical Processes”

Springer, 1996

[42] Peter Wagner and Bernadette Rabuy

“Mass incarceration: The whole pie 2016”

In Retrieved from the Prison Policy Initiative Website: www.prisonpolicy.org/reports/pie2016.html, 2016

[43] Jessica G Young et al.

“Comparative effectiveness of dynamic treatment regimes: an application of the parametric g-formula”

In Statistics in Biosciences 3.1

Springer, 2011, pp. 119–143

[44] Jessica G Young, Miguel A Hernán and James M Robins

“Identification, estimation and approximation of risk under interventions that depend on the natural value of treatment using observational data”

In Epidemiologic Methods 3.1, 2014, pp. 1–19

[45] Wenjing Zheng and Mark J van der Laan

“Asymptotic theory for cross-validated targeted maximum likelihood estimation”

In UC Berkeley Division of Biostatistics Working Paper Series Paper 273, 2010, pp. 1–58

8 Appendix

8.1 Proof of Theorem 1

First we give a useful identification result for general stochastic intervention effects.

Lemma 1.

Let $\mathbf{Q}=(Q_{1},...,Q_{T})$ denote a general stochastic intervention in which treatment at time $t$ is randomly assigned according to distribution function $Q_{t}(a_{t}\mid\mathbf{h}_{t})$ . Under Assumptions 1–2, and if (weak) positivity holds in the sense that

[TABLE]

then the mean outcome $\mathbb{E}(Y^{\mathbf{Q}})$ under the intervention is identified by

[TABLE]

where $\mathcal{A}=\mathcal{A}_{1}\times\dots\mathcal{A}_{T}$ and $\mathcal{X}=\mathcal{X}_{1}\dots\mathcal{X}_{T}$ .

Proof.

This essentially follows by the g-formula of [23]. Let underbars denote the future of a sequence so that for example $\underline{\mathbf{Q}}_{t}=(Q_{t},...,Q_{T})$ . Then we have the recursion

[TABLE]

for $t=1,...,T$ , where the first equality follows by iterated expectation, the second by definition, the third since $Q_{t}\perp\!\!\!\perp Y^{\mathbf{Q}}\mid\mathbf{H}_{t}$ (by definition) along with exchangeability (Assumption 2), and the fourth by simply rewriting the index $q_{t}$ as $a_{t}$ . The weak positivity condition is required so that the above outer expectation is well-defined (that the inner expectation may not be is fine since, by positivity, in such cases the multiplier $dQ_{t}$ will be zero).

Therefore applying the above $T$ times yields

[TABLE]

where the last equality follows by consistency (Assumption 1). ∎

Now Theorem 1 follows from Lemma 1, letting

[TABLE]

and noting that

[TABLE]

so that the weak positivity condition is automatically satisfied by our choice of $dQ_{t}$ .

8.2 Proof of Theorem 2

First we derive the efficient influence function for a general stochastic intervention effect when the intervention distribution $Q$ does not depend on the observed data distribution $\mathbb{P}$ .

Lemma 2.

Suppose $\mathbf{Q}$ is a known stochastic intervention not depending on $\mathbb{P}$ . Define

[TABLE]

for $t=0,...,T-1$ and $\mathcal{R}_{t}=(\mathcal{H}_{T}\times\mathcal{A}_{T})\setminus\mathcal{H}_{t}$ , and let $m_{T}(\mathbf{h}_{T},a_{T})=\mu(\mathbf{h}_{T},a_{T})$ and $m_{T+1}(\mathbf{h}_{T+1},a_{T+1})=Y$ . Then the efficient influence function for $\psi^{*}(\mathbf{Q})=m_{0}$ is

[TABLE]

where we define $dQ_{T+1}=1$ and $dQ_{0}(a_{0}\mid\mathbf{h}_{0})/d\mathbb{P}(a_{0}\mid\mathbf{h}_{0})=1$ .

Lemma 3.

Suppose $\mathbf{Q}$ depends on $\mathbb{P}$ , and let $\{\mathbbold{1}(\mathbf{H}_{t}=\mathbf{h}_{t})/d\mathbb{P}(\mathbf{h}_{t})\}\phi_{t}(\mathbf{H}_{t},A_{t};a_{t})$ denote the efficient influence function for $dQ_{t}(a_{t}\mid\mathbf{h}_{t})$ . Then the efficient influence for $\mathbb{E}(Y^{\mathbf{Q}})$ allowing $\mathbf{Q}$ to depend on $\mathbb{P}$ is given by

[TABLE]

where $\varphi^{*}(\mathbf{Q})$ denotes the efficient influence function from Lemma 2 under an intervention $\mathbf{Q}$ not depending on $\mathbb{P}$ , and $\nu$ is a dominating measure for the distribution of $A_{t}$ .

The proofs of Lemmas 2 and 3 are based on chain rule arguments stemming from the fact that the efficient influence function is a pathwise derivative. In particular (in a nonparametric model) the efficient influence function for parameter $\psi=\psi(\mathbb{P})$ is the function $\varphi(\mathbb{P})$ satisfying

[TABLE]

where $\{\mathbb{P}_{\epsilon}:\epsilon\in\mathbb{R}\}$ is a smooth parametric submodel with $\mathbb{P}_{\epsilon=0}=\mathbb{P}$ . We omit the proofs since they are lengthy and not particularly illuminating; however we plan to include them in a forthcoming paper on general stochastic interventions.

Lemma 4.

The efficient influence function for

[TABLE]

is given by $\{\mathbbold{1}(\mathbf{H}_{t}=\mathbf{h}_{t})/d\mathbb{P}(\mathbf{h}_{t})\}\phi_{t}(\mathbf{H}_{t},A_{t};a_{t})$ where $\phi_{t}(\mathbf{H}_{t},A_{t};a_{t})$ equals

[TABLE]

Proof.

This result also follows from the chain rule, together with the fact that the efficient influence function for $\pi_{t}$ is given by

[TABLE]

∎

8.3 Z-Estimator Algorithm

Algorithm 2 (Z-estimator algorithm).

For each $\delta$ :

Regress $A_{t}$ on $\mathbf{H}_{t}$ , obtain predicted values $\hat{\pi}_{t}(\mathbf{H}_{t})$ for each subject/time. 2. 2.

Construct time-dependent weights $W_{t}=\frac{\delta A_{t}+1-A_{t}}{\delta\hat{\pi}_{t}(\mathbf{H}_{t})+1-\hat{\pi}_{t}(\mathbf{H}_{t})}$ for each subject/time. 3. 3.

Calculate cumulative product weight $\widetilde{W}_{t}=\prod_{s=1}^{t}W_{s}$ for each subject/time. 4. 4.

For each time $t=T,T-1,...,1$ (starting with $R_{T+1}=Y$ ):

(a)

Regress $R_{t+1}$ on $(\mathbf{H}_{t},A_{t})$ , obtain predicted values $\hat{m}_{t}(\mathbf{H}_{t},1)$ and $\hat{m}_{t}(\mathbf{H}_{t},0)$ . 2. (b)

Construct pseudo-outcome $R_{t}=\frac{\delta\hat{\pi}_{t}(\mathbf{H}_{t})\hat{m}_{t}(\mathbf{H}_{t},1)+\{1-\hat{\pi}_{t}(\mathbf{H}_{t})\}\hat{m}_{t}(\mathbf{H}_{t},0)}{\delta\hat{\pi}_{t}(\mathbf{H}_{t})+1-\hat{\pi}_{t}(\mathbf{H}_{t})}$ . 5. 5.

For each subject compute $\varphi=\widetilde{W}_{T}Y+\sum_{t}\widetilde{W}_{t}V_{t}R_{t}$ where $V_{t}=\frac{A_{t}\{1-\hat{\pi}_{t}(\mathbf{H}_{t})\}-(1-A_{t})\delta\hat{\pi}_{t}(\mathbf{H}_{t})}{\delta/(1-\delta)}$ . 6. 6.

Set $\hat{\psi}^{*}(\delta)$ to be the average of the $\varphi$ values across subjects.

8.4 Proof of Theorem 3

Let $\|f\|_{\mathcal{D}}=\sup_{\delta\in\mathcal{D}}|f(\delta)|$ denote the supremum norm over $\mathcal{D}$ , and define the processes

[TABLE]

where $\mathbb{G}_{n}=\sqrt{n}(\mathbb{P}_{n}-\mathbb{P})$ is the empirical process on the full sample process as usual. Also let $\mathbb{G}(\delta)$ denote the mean-zero Gaussian process with covariance $\mathbb{E}\{\widetilde{\varphi}(\mathbf{Z};\boldsymbol{\eta},\delta_{1})\widetilde{\varphi}(\mathbf{Z};\boldsymbol{\eta},\delta_{2})\}$ as in the main text.

In this proof we will show that

[TABLE]

which yields the desired result. The first statement will be true if the influence function $\widetilde{\varphi}$ is a smooth enough function of $\delta$ , and the second if the nuisance estimators $\hat{\boldsymbol{\eta}}$ are consistent and converging at a sufficiently fast rate.

The first statement follows since the function class $\mathcal{F}_{\boldsymbol{\overline{\eta}}}=\{\varphi(\cdot;\boldsymbol{\overline{\eta}},\delta):\delta\in\mathcal{D}\}$ is Lipschitz and thus has a finite bracketing integral for any fixed $\boldsymbol{\overline{\eta}}$ . Recall the $L_{2}(\mathbb{P})$ bracketing integral of class $\mathcal{F}$ with envelope $F$ is given by

[TABLE]

where $N_{[\ \!]}(\epsilon,\mathcal{F},L_{2}(\mathbb{P}))$ is the $L_{2}(\mathbb{P})$ bracketing number, i.e., the minimum number of $\epsilon$ brackets in $L_{2}(\mathbb{P})$ needed to cover the class $\mathcal{F}$ with envelope function $F$ . That $\mathcal{F}_{\boldsymbol{\overline{\eta}}}$ is Lipschitz (and thus the bracketing integral is finite) follows from the fact that $\varphi$ is a sum of products of Lipschitz functions and $\mathcal{D}$ is bounded. We show this by showing that the corresponding derivatives are all bounded, specifically

[TABLE]

where we used the fact that, for all $0\leq\pi_{t}(\mathbf{h}_{t})\leq 1$ , we have

[TABLE]

Therefore $\Psi_{n}(\cdot)\rightsquigarrow\mathbb{G}(\cdot)$ since a function class with finite bracketing integral is necessarily Donsker (e.g., Theorem 2.5.6 in [41]).

Now we consider the second statement, that $\|\widehat{\Psi}_{n}-\Psi_{n}\|_{\mathcal{D}}=o_{\mathbb{P}}(1)$ . First note that

[TABLE]

where the last inequality follows since $\|\hat{\sigma}/\sigma-1\|_{\mathcal{D}}=o_{\mathbb{P}}(1)$ by Assumption 3 of Theorem 3, and $\|\Psi_{n}\|_{\mathcal{D}}=O_{\mathbb{P}}(1)$ follows from, e.g., Theorem 2.14.2 in [41], since the function class $\mathcal{F}_{\boldsymbol{\eta}}$ has finite bracketing integral as shown above.

Now let $N=n/K$ be the sample size in any group $k=1,...,K$ , and denote the empirical process over group $k$ units by $\mathbb{G}_{n}^{k}=\sqrt{N}(\mathbb{P}_{n}^{k}-\mathbb{P})$ . Then we have

[TABLE]

where the first two equalities follow by definition, and the third by rearranging and noting that $\psi(\delta)=\mathbb{P}\{\varphi(\mathbf{Z};\boldsymbol{\eta},\delta)\}$ and $\sum_{k}\mathbb{P}_{n}^{k}\{\varphi(\mathbf{Z};\boldsymbol{\eta},\delta)\}=\sum_{k}\mathbb{P}_{n}\{\varphi(\mathbf{Z};\boldsymbol{\eta},\delta)\}$ . Now we will analyze the two pieces $B_{n,1}$ and $B_{n,2}$ in turn; showing that their supremum norms are both $o_{\mathbb{P}}(1)$ completes the proof.

For $B_{n,1}$ , we have by the triangle inequality and since $K$ is fixed (independent of total sample size $n$ ), that

[TABLE]

where $\mathcal{F}_{n}^{k}=\mathcal{F}_{\hat{\boldsymbol{\eta}}_{\text{-}k}}-\mathcal{F}_{\boldsymbol{\eta}}$ for the function class $\mathcal{F}_{\boldsymbol{\eta}}=\{\varphi(\cdot;\boldsymbol{\eta},\delta):\delta\in\mathcal{D}\}$ from before. Viewing the nuisance functions $\hat{\boldsymbol{\eta}}_{\text{-}k}$ as fixed given the training data $\mathbf{D}_{0}^{k}=\{\mathbf{Z}_{i}:S_{i}\neq k\}$ , we can apply Theorem 2.14.2 in [41] to obtain

[TABLE]

for envelope $F_{n}^{k}$ . If we take $F_{n}^{k}(\mathbf{z})=\sup_{\delta\in\mathcal{D}}|\varphi(\mathbf{z};\hat{\boldsymbol{\eta}}_{\text{-}k},\delta)-\varphi(\mathbf{z};\boldsymbol{\eta},\delta)|$ then the first term $\|F_{n}^{k}\|$ in the product above is $o_{\mathbb{P}}(1)$ . Although the bracketing integral is finite for any fixed $\boldsymbol{\eta}$ , here the function class depends on $n$ through $\hat{\boldsymbol{\eta}}_{\text{-}k}$ so we need a more careful analysis.

Specifically, since $\mathcal{F}_{n}^{k}$ is Lipschitz, by Theorem 2.7.2 of [41] we have

[TABLE]

Therefore, letting $C_{n}^{k}=\|F_{n}^{k}\|$ ,

[TABLE]

which tends to zero as $C_{n}^{k}\rightarrow 0$ . Hence $\sup_{f\in\mathcal{F}_{n}^{k}}\left|\mathbb{G}_{n}(f)\right|=o_{\mathbb{P}}(1)$ for each $k$ , and since there are only finitely many splits $K$ , we have

[TABLE]

To analyze $B_{n,2}(\delta)$ we require some new notation, and at first we typically suppress any dependence on $\delta$ for simplicity. Let $\psi(\mathbb{P};Q)$ denote the mean outcome under intervention $Q$ for a population corresponding to observed data distribution $\mathbb{P}$ , and let $\varphi^{*}(\mathbf{z};\boldsymbol{\eta})$ denote its (centered) efficient influence function when $Q$ does not depend on $\mathbb{P}$ , as given in Lemma 2, which depends on nuisance functions $\boldsymbol{\eta}=(\mathbf{m},\boldsymbol{\pi})=(m_{0},m_{1},...,m_{T},\pi_{1},\pi_{2},...,\pi_{T})$ . Similarly let $\zeta(\mathbf{z};\boldsymbol{\eta})$ denote the contribution to the efficient influence function $\varphi^{*}(\mathbf{z};\boldsymbol{\eta})$ due to estimating $Q$ when it depends on $\mathbb{P}$ , as given in Lemma 3. Then by definition

[TABLE]

Hence, for any $\boldsymbol{\overline{\eta}}$ we can write $(1/\sqrt{n})B_{n,2}(\delta)$ as

[TABLE]

where the first equality follows by definition and the second by rearranging.

In the following lemmas we analyze these two components of the remainder term $B_{n,2}(\delta)$ . Our results keep the intervention distribution $Q$ completely general, and so can be applied to study other stochastic interventions, beyond those we focus on in this paper of the incremental propensity score variety.

Lemma 5.

Let $\psi(\mathbb{P};Q)$ denote the mean outcome under intervention $Q$ for a population corresponding to observed data distribution $\mathbb{P}$ , and let $\varphi^{*}(\mathbf{z};\boldsymbol{\eta})$ denote its efficient influence function when $Q$ does not depend on $\mathbb{P}$ , as given in Lemma 2, which depends on nuisance functions $\boldsymbol{\eta}=(\mathbf{m},\boldsymbol{\pi})=(m_{0},m_{1},...,m_{T},\pi_{1},\pi_{2},...,\pi_{T})$ . Then for two distributions $\mathbb{P}$ and $\overline{\mathbb{P}}$ (the latter with corresponding nuisance functions $\boldsymbol{\overline{\eta}}$ ) we have the expansion

[TABLE]

where we define

[TABLE]

Proof.

First note that

[TABLE]

where the first equality follows by definition, the second by iterated expectation (conditioning on $(\mathbf{H}_{t},A_{t})$ and averaging over $\mathbf{X}_{t+1}$ ), the third by definition of $m_{t}^{*}$ , and the fourth by repeated iterated expectation. Now we have

[TABLE]

where the first equality follows by adding and subtracting the second term in the sum (and separating the $t=0$ term), and the second follows by repeating this process $t$ times (where we use the convention that quantities at negative times like $dQ_{-1}$ are set to one). The last terms in the last line above are a telescoping sum since

[TABLE]

Therefore the result follows after rearranging and noting $\psi_{Q}^{*}(\mathbb{P})=m_{0}$ and $\psi_{Q}^{*}(\overline{\mathbb{P}})=\overline{m}_{0}$ . ∎

Lemma 6.

Using the same notation as in Lemma 5, let $\zeta(\mathbf{Z};\boldsymbol{\eta})$ denote the contribution to the efficient influence function $\varphi^{*}(\mathbf{Z};\boldsymbol{\eta})$ as given in Lemma 3. Then for two intervention distributions $Q$ and $\overline{Q}$ (assumed to have densities $dQ_{t}$ and $d\overline{Q}_{t}$ , respectively, for $t=1,...,T$ , with respect to some dominating measure) we have the expansion

[TABLE]

Proof.

First note that

[TABLE]

where the first equality follows by definition, the second by adding and subtracting the last term, the third by definition of $m_{t}$ , and the fourth by repeating this process $T$ times.

Now we have that the expected contribution to the influence function due to estimating $Q$ when it depends on $\mathbb{P}$ is

[TABLE]

where the first equality follows by iterated expectation, the second by adding and subtracting the second term in the sum, and the third by the same logic as in Lemma 5.

Now considering the last term in the above display plus $\psi(\mathbb{P};\overline{Q})-\psi(\mathbb{P};Q)$ we have

[TABLE]

which yields the result. ∎

Now we need to translate the remainder terms from Lemmas 5 and 6 to the incremental propensity score intervention setting. The remainder from Lemma 5 equals

[TABLE]

where the last inequality follows since

[TABLE]

For the remainder from Lemma 6 first note that

[TABLE]

where we used the form of the efficient influence function derived in Lemma 4. Combining the two previous expressions gives

[TABLE]

Thus the remainder from Lemma 6 is

[TABLE]

The condition given in Theorem 3, that for $s\leq t\leq T$ we have

[TABLE]

therefore ensures that the above remainders from Lemmas 5 and 6 are negligible up to order $n^{-1/2}$ uniformly in $\delta$ . Therefore $\|B_{n,2}\|_{\mathcal{D}}=o_{\mathbb{P}}(1)$ , which concludes the proof.

8.5 Proof of Theorem 4

As in the proof of Theorem 3, let $\|f\|_{\mathcal{D}}=\sup_{\delta\in\mathcal{D}}|f(\delta)|$ denote the supremum norm with respect to $\delta$ , and define the processes

[TABLE]

Note that the star superscripts denote multiplier bootstrap processes. As before, let $\mathbb{G}(\delta)$ denote the mean-zero Gaussian process with covariance $\mathbb{E}\{\widetilde{\varphi}(\mathbf{Z};\boldsymbol{\eta},\delta_{1})\widetilde{\varphi}(\mathbf{Z};\boldsymbol{\eta},\delta_{2})\}$ .

Since

[TABLE]

the result of Theorem 4 requires that we show

[TABLE]

which yields the desired result since $\mathbb{P}(\|\widehat{\Psi}^{*}_{n}\|_{\mathcal{D}}\leq\hat{c}_{\alpha})=1-\alpha$ by definition of $\hat{c}_{\alpha}$ .

We showed in the proof of Theorem 3 that $\|\widehat{\Psi}_{n}-\Psi_{n}\|_{\mathcal{D}}=o_{\mathbb{P}}(1)$ which implies that $|\|\widehat{\Psi}_{n}\|_{\mathcal{D}}-\|\Psi_{n}\|_{\mathcal{D}}|=o_{\mathbb{P}}(1)$ , and by Corollary 2.2 in [6] we have

[TABLE]

Hence by Lemma 2.3 in [6] it follows that

[TABLE]

Similarly, by Corollary 2.2 of [2] we have

[TABLE]

so that again by Lemma 2.3 in [6]

[TABLE]

This yields the result, since $|\mathbb{P}(\|\widehat{\Psi}_{n}\|_{\mathcal{D}}\leq\hat{c}_{\alpha})-\mathbb{P}(\|\widehat{\Psi}^{*}_{n}\|_{\mathcal{D}}\leq\hat{c}_{\alpha})|$ is bounded above by

[TABLE]

8.6 R Code

this function requires the following inputs:

dat: dataframe (in long not wide form if longitudinal) with columns

‘time’, ‘id’, outcome ‘y’, treatment ‘a’

x.trt: covariate matrix for treatment regression

x.out: covariate matrix for outcome regression

delta.seq: sequence of delta values

nsplits: number of sample splits

NOTE: dat, x.trt, x.out should all have the same number of rows

ipsi <- function(dat, x.trt, x.out, delta.seq, nsplits){

setup storage

ntimes <- length(table(dat $time)); n <- length(unique(dat$ id)) k <- length(delta.seq); ifvals <- matrix(nrow=n,ncol=k); est.eff <- rep(NA,k) wt <- matrix(nrow=nntimes,ncol=k); cumwt <- matrix(nrow=nntimes,ncol=k) rt <- matrix(nrow=nntimes,ncol=k); vt <- matrix(nrow=nntimes,ncol=k)

s <- sample(rep(1:nsplits,ceiling(n/nsplits))[1:n]) slong <- rep(s,rep(ntimes,n))

for (split in 1:nsplits){ print(paste("split",split)); flush.console()

fit treatment model

trtmod <- ranger(a ~ ., dat=cbind(x.trt,a=dat $a)[slong!=split,]) dat$ ps <- predict(trtmod, data=x.trt)$predictions

for (j in 1:k){ print(paste("delta",j)); flush.console() delta <- delta.seq[j]

compute weights

wt[,j] <- (deltadat $a + 1-dat$ a)/(deltadat $ps + 1-dat$ ps) cumwt[,j] <- as.numeric(t(aggregate(wt[,j],by=list(dat $id),cumprod)[,-1])) vt[,j] <- (1-delta)*(dat$ a*(1-dat $ps) - (1-dat$ a)deltadat$ps)/delta

fit outcome models

outmod <- vector("list",ntimes); rtp1 <- dat $y[dat$ time==end] print("fitting regressions"); flush.console() for (i in 1:ntimes){ t <- rev(unique(dat $time))[i] outmod[[i]] <- ranger(rtp1 ~ ., dat=cbind(x.out,rtp1)[dat$ time==t & slong!=split,]) newx1 <- x.out[dat $time==t,]; newx1$ a <- 1 m1 <- predict(outmod[[i]], data=newx1) $predictions newx0 <- x.out[dat$ time==t,]; newx0 $a <- 0 m0 <- predict(outmod[[i]], data=newx0)$ predictions pi.t <- dat $ps[dat$ time==t] rtp1 <- (deltapi.tm1 + (1-pi.t)m0) / (deltapi.t + 1-pi.t) rt[dat$time==t,j] <- rtp1 }

ifvals[s==split,j] <- ((cumwt[,j]*dat $y)[dat$ time==end] + aggregate(cumwt[,j]*vt[,j]*rt[,j],by=list(dat$id),sum)[,-1])[s==split]

} }

compute estimator

for (j in 1:k){ est.eff[j] <- mean(ifvals[,j]) }

compute asymptotic variance

sigma <- sqrt(apply(ifvals,2,var)) eff.ll <- est.eff-1.96sigma/sqrt(n); eff.ul <- est.eff+1.96sigma/sqrt(n)

multiplier bootstrap

eff.mat <- matrix(rep(est.eff,n),nrow=n,byrow=T) sig.mat <- matrix(rep(sigma,n),nrow=n,byrow=T) ifvals2 <- (ifvals-eff.mat)/sig.mat nbs <- 10000; mult <- matrix(2rbinom(nnbs,1,.5)-1,nrow=n,ncol=nbs) maxvals <- sapply(1:nbs, function(col){ max(abs(apply(mult[,col]ifvals2,2,sum)/sqrt(n))) } ) calpha <- quantile(maxvals, 0.95) eff.ll2 <- est.eff-calphasigma/sqrt(n); eff.ul2 <- est.eff+calpha*sigma/sqrt(n)

return(list(est=est.eff, sigma=sigma, ll1=eff.ll,ul1=eff.ul, calpha=calpha, ll2=eff.ll2,ul2=eff.ul2))

}

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Valerio Bacak and Edward H Kennedy “Marginal structural models: An application to incarceration and marriage during young adulthood” In Journal of Marriage and Family 77.1 Wiley Online Library, 2015, pp. 112–125
2[2] Alexandre Belloni, Victor Chernozhukov, Denis Chetverikov and Ying Wei “Uniformly valid post-regularization confidence regions for many functional parameters in Z-estimation framework” In ar Xiv preprint ar Xiv:1512.07619 , 2015
3[3] Peter J Bickel, Chris AJ Klaassen, Ya’acov Ritov and Jon A Wellner “Efficient and Adaptive Estimation for Semiparametric Models” Johns Hopkins University Press, 1993
4[4] Lauren E Cain et al. “When to start treatment? A systematic approach to the comparison of dynamic regimes using observational data” In The International Journal of Biostatistics 6.2 , 2010, pp. 1–24
5[5] Victor Chernozhukov et al. “Double machine learning for treatment and causal parameters” In ar Xiv preprint ar Xiv:1608.00060 , 2016, pp. 1–37
6[6] Victor Chernozhukov, Denis Chetverikov and Kengo Kato “Gaussian approximation of suprema of empirical processes” In The Annals of Statistics 42.4 Institute of Mathematical Statistics, 2014, pp. 1564–1597
7[7] Todd R Clear “Imprisoning communities: How mass incarceration makes disadvantaged neighborhoods worse” Oxford University Press, 2009
8[8] Iván Díaz and Mark J van der Laan “Assessing the causal effect of policies: an example using stochastic interventions” In The International Journal of Biostatistics 9.2 , 2013, pp. 161–174

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Nonparametric causal effects based on incremental propensity score interventions

Abstract

1 Introduction

2 Notation & Setup

Remark 1*.*

3 Incremental Intervention Theory

3.1 Proposed Interventions

3.2 Identification

Assumption 1* (Consistency).*

Assumption 2* (Exchangeability).*

Theorem 1**.**

Corollary 1**.**

3.3 Efficiency Theory

Theorem 2**.**

Remark 2*.*

Corollary 2**.**

4 Estimation & Inference

4.1 Simple Estimators

4.2 Proposed Estimator

Algorithm 1**.**

4.3 Weak Convergence

Theorem 3**.**

4.4 Uniform Inference & Testing No Effect

Theorem 4**.**

5 Illustrations

5.1 Simulation Study

5.2 Application

6 Discussion

8 Appendix

8.1 Proof of Theorem 1

Lemma 1**.**

Proof.

8.2 Proof of Theorem 2

Lemma 2**.**

Lemma 3**.**

Lemma 4**.**

Proof.

8.3 Z-Estimator Algorithm

Algorithm 2** (Z-estimator algorithm).**

8.4 Proof of Theorem 3

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

8.5 Proof of Theorem 4

8.6 R Code

this function requires the following inputs:

dat: dataframe (in long not wide form if longitudinal) with columns

‘time’, ‘id’, outcome ‘y’, treatment ‘a’

x.trt: covariate matrix for treatment regression

x.out: covariate matrix for outcome regression

delta.seq: sequence of delta values

nsplits: number of sample splits

NOTE: dat, x.trt, x.out should all have the same number of rows

setup storage

fit treatment model

compute weights

fit outcome models

compute estimator

compute asymptotic variance

multiplier bootstrap

*Remark 1**.*

*Assumption 1** (Consistency).*

*Assumption 2** (Exchangeability).*

Theorem 1.

Corollary 1.

Theorem 2.

*Remark 2**.*

Corollary 2.

Algorithm 1.

Theorem 3.

Theorem 4.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Algorithm 2 (Z-estimator algorithm).

Lemma 5.

Lemma 6.