Non-separable Models with High-dimensional Data

Liangjun Su; Takuya Ura; Yichong Zhang

arXiv:1702.04625·stat.ME·March 7, 2019

Non-separable Models with High-dimensional Data

Liangjun Su, Takuya Ura, Yichong Zhang

PDF

Open Access

TL;DR

This paper introduces a three-step estimation method for non-separable models with high-dimensional control variables, enabling the estimation of various treatment effects with theoretical guarantees and practical validation.

Contribution

It develops a novel three-step estimation procedure for high-dimensional non-separable models with continuous treatments, including inference methods and finite sample performance analysis.

Findings

01

Estimators perform well in finite samples.

02

The method effectively handles high-dimensional control variables.

03

Asymptotic properties are established for the estimators.

Abstract

This paper studies non-separable models with a continuous treatment when the dimension of the control variables is high and potentially larger than the effective sample size. We propose a three-step estimation procedure to estimate the average, quantile, and marginal treatment effects. In the first stage we estimate the conditional mean, distribution, and density objects by penalized local least squares, penalized local maximum likelihood estimation, and numerical differentiation, respectively, where control variables are selected via a localized method of L1-penalization at each value of the continuous treatment. In the second stage we estimate the average and marginal distribution of the potential outcome via the plug-in principle. In the third stage, we estimate the quantile and marginal treatment effects by inverting the estimated distribution function and using the local linear…

Equations698

Y = Γ (T, X, A),

Y = Γ (T, X, A),

c_{f} = \int_{(x, a) : Γ (t, x, a) = q_{τ} (t)} \frac{f _{(X, A)} ( x , a )}{∥ \nabla _{(x, a)} Γ ( t , \cdot , \cdot ) ∥} d x d a .

c_{f} = \int_{(x, a) : Γ (t, x, a) = q_{τ} (t)} \frac{f _{(X, A)} ( x , a )}{∥ \nabla _{(x, a)} Γ ( t , \cdot , \cdot ) ∥} d x d a .

\int g (s) d_{t} (s) d s = g (t) .

\int g (s) d_{t} (s) d s = g (t) .

\mathbb{E}(Y(t))=\mathbb{E}\biggl{(}\frac{Yd_{t}(T)}{f_{t}(X)}\biggr{)}\quad\text{and}\quad\mathbb{E}(Y_{u}(t))=\mathbb{E}\biggl{(}\frac{Y_{u}d_{t}(T)}{f_{t}(X)}\biggr{)}.

\mathbb{E}(Y(t))=\mathbb{E}\biggl{(}\frac{Yd_{t}(T)}{f_{t}(X)}\biggr{)}\quad\text{and}\quad\mathbb{E}(Y_{u}(t))=\mathbb{E}\biggl{(}\frac{Y_{u}d_{t}(T)}{f_{t}(X)}\biggr{)}.

\mathbb{E}(Y(t))=\mathbb{E}\biggl{[}\biggl{(}\frac{(Y-\nu_{t}(X))d_{t}(T)}{f_{t}(X)}\biggr{)}+\nu_{t}(X)\biggr{]}

\mathbb{E}(Y(t))=\mathbb{E}\biggl{[}\biggl{(}\frac{(Y-\nu_{t}(X))d_{t}(T)}{f_{t}(X)}\biggr{)}+\nu_{t}(X)\biggr{]}

\mathbb{E}(Y_{u}(t))=\mathbb{E}\biggl{[}\biggl{(}\frac{(Y_{u}-\phi_{t,u}(X))d_{t}(T)}{f_{t}(X)}\biggr{)}+\phi_{t,u}(X)\biggr{]}.

\mathbb{E}(Y_{u}(t))=\mathbb{E}\biggl{[}\biggl{(}\frac{(Y_{u}-\phi_{t,u}(X))d_{t}(T)}{f_{t}(X)}\biggr{)}+\phi_{t,u}(X)\biggr{]}.

\hat{\mu}(t)=\frac{1}{n}\sum_{i=1}^{n}\biggl{[}\biggl{(}\frac{(Y-\widehat{\nu}_{t}(X_{i}))}{\hat{f}_{t}(X_{i})h_{2}}K(\frac{T_{i}-t}{h_{2}})\biggr{)}+\widehat{\nu}_{t}(X_{i})\biggr{]}

\hat{\mu}(t)=\frac{1}{n}\sum_{i=1}^{n}\biggl{[}\biggl{(}\frac{(Y-\widehat{\nu}_{t}(X_{i}))}{\hat{f}_{t}(X_{i})h_{2}}K(\frac{T_{i}-t}{h_{2}})\biggr{)}+\widehat{\nu}_{t}(X_{i})\biggr{]}

\hat{\alpha}(t,u)=\frac{1}{n}\sum_{i=1}^{n}\biggl{[}\biggl{(}\frac{(Y_{u}-\widehat{\phi}_{t,u}(X_{i}))}{\hat{f}_{t}(X_{i})h_{2}}K(\frac{T_{i}-t}{h_{2}})\biggr{)}+\widehat{\phi}_{t,u}(X_{i})\biggr{]},\quad\text{respectively,}

\hat{\alpha}(t,u)=\frac{1}{n}\sum_{i=1}^{n}\biggl{[}\biggl{(}\frac{(Y_{u}-\widehat{\phi}_{t,u}(X_{i}))}{\hat{f}_{t}(X_{i})h_{2}}K(\frac{T_{i}-t}{h_{2}})\biggr{)}+\widehat{\phi}_{t,u}(X_{i})\biggr{]},\quad\text{respectively,}

\overset{γ}{^}_{t} = γ arg min \frac{1}{2 n} i = 1 \sum n (Y_{i} - b (X_{i})^{'} γ)^{2} K (\frac{T _{i} - t}{h _{1}}) + \frac{λ}{n} ∣∣ Ξ_{t} γ ∣ ∣_{1},

\overset{γ}{^}_{t} = γ arg min \frac{1}{2 n} i = 1 \sum n (Y_{i} - b (X_{i})^{'} γ)^{2} K (\frac{T _{i} - t}{h _{1}}) + \frac{λ}{n} ∣∣ Ξ_{t} γ ∣ ∣_{1},

\hat{θ}_{t, u} = θ arg min \frac{1}{n} i = 1 \sum n M (1 {Y_{i} \leq u}, X_{i}; θ) K (\frac{T _{i} - t}{h _{1}}) + \frac{λ}{n} ∣∣ Ψ_{t, u} θ ∣ ∣_{1},

\hat{θ}_{t, u} = θ arg min \frac{1}{n} i = 1 \sum n M (1 {Y_{i} \leq u}, X_{i}; θ) K (\frac{T _{i} - t}{h _{1}}) + \frac{λ}{n} ∣∣ Ψ_{t, u} θ ∣ ∣_{1},

Φ^{- 1} (1 - γ / p) \sim [lo g (1/ C) + lo g (p) + lo g (n) + lo g (lo g (n))]^{1/2} \sim lo g (p \lor n) .

Φ^{- 1} (1 - γ / p) \sim [lo g (1/ C) + lo g (p) + lo g (n) + lo g (lo g (n))]^{1/2} \sim lo g (p \lor n) .

\tilde{l}_{t,0,j}=\biggl{|}\biggl{|}(Y-\nu_{t}(X))b_{j}(X)K(\frac{T-t}{h_{1}})h_{1}^{-1/2}\biggr{|}\biggr{|}_{\mathbb{P}_{n},2}

\tilde{l}_{t,0,j}=\biggl{|}\biggl{|}(Y-\nu_{t}(X))b_{j}(X)K(\frac{T-t}{h_{1}})h_{1}^{-1/2}\biggr{|}\biggr{|}_{\mathbb{P}_{n},2}

l_{t,u,0,j}=\biggl{|}\biggl{|}(Y_{u}-\phi_{t,u}(X))b_{j}(X)K(\frac{T-t}{h_{1}})h_{1}^{-1/2}\biggr{|}\biggr{|}_{\mathbb{P}_{n},2},

l_{t,u,0,j}=\biggl{|}\biggl{|}(Y_{u}-\phi_{t,u}(X))b_{j}(X)K(\frac{T-t}{h_{1}})h_{1}^{-1/2}\biggr{|}\biggr{|}_{\mathbb{P}_{n},2},

\tilde{l}_{t,j}^{k}=\biggl{|}\biggl{|}(Y-\widehat{\nu}_{t}^{k-1}(X))b_{j}(X)K(\frac{T-t}{h_{1}})h_{1}^{-1/2}\biggr{|}\biggr{|}_{\mathbb{P}_{n},2}

\tilde{l}_{t,j}^{k}=\biggl{|}\biggl{|}(Y-\widehat{\nu}_{t}^{k-1}(X))b_{j}(X)K(\frac{T-t}{h_{1}})h_{1}^{-1/2}\biggr{|}\biggr{|}_{\mathbb{P}_{n},2}

l_{t,u,j}^{k}=\biggl{|}\biggl{|}(Y_{u}-\widehat{\phi}_{t,u}^{k-1}(X))b_{j}(X)K(\frac{T-t}{h_{1}})h_{1}^{-1/2}\biggr{|}\biggr{|}_{\mathbb{P}_{n},2}.

l_{t,u,j}^{k}=\biggl{|}\biggl{|}(Y_{u}-\widehat{\phi}_{t,u}^{k-1}(X))b_{j}(X)K(\frac{T-t}{h_{1}})h_{1}^{-1/2}\biggr{|}\biggr{|}_{\mathbb{P}_{n},2}.

\tilde{γ}_{t} \in γ arg min i = 1 \sum n (Y_{i} - b (X_{i})^{'} γ)^{2} K (\frac{T _{i} - t}{h _{1}}), s . t . Supp (γ) \in S_{t}^{μ},

\tilde{γ}_{t} \in γ arg min i = 1 \sum n (Y_{i} - b (X_{i})^{'} γ)^{2} K (\frac{T _{i} - t}{h _{1}}), s . t . Supp (γ) \in S_{t}^{μ},

\tilde{θ}_{t, u} \in θ arg min i = 1 \sum n M (1 {Y_{i} \leq u}, X_{i}; θ) K (\frac{T _{i} - t}{h _{1}}), s . t . Supp (θ) \in S_{t, u} .

\tilde{θ}_{t, u} \in θ arg min i = 1 \sum n M (1 {Y_{i} \leq u}, X_{i}; θ) K (\frac{T _{i} - t}{h _{1}}), s . t . Supp (θ) \in S_{t, u} .

\hat{β}_{t} = β arg min \frac{1}{n} i = 1 \sum n M (1 {T_{i} \leq t}, X_{i}; β) + \frac{λ ~}{n} ∣∣ \hat{Ψ}_{t} β ∣ ∣_{1} and \hat{F}_{t} (x) = Λ (b (x)^{'} \hat{β}_{t}),

\hat{β}_{t} = β arg min \frac{1}{n} i = 1 \sum n M (1 {T_{i} \leq t}, X_{i}; β) + \frac{λ ~}{n} ∣∣ \hat{Ψ}_{t} β ∣ ∣_{1} and \hat{F}_{t} (x) = Λ (b (x)^{'} \hat{β}_{t}),

\tilde{λ} = 1.1 Φ^{- 1} (1 - γ / {p \lor n h_{1}}) n^{1/2}

\tilde{λ} = 1.1 Φ^{- 1} (1 - γ / {p \lor n h_{1}}) n^{1/2}

l_{t,j}^{k}=\biggl{|}\biggl{|}\biggl{(}1\{T\leq t\}-\hat{F}_{t}^{k-1}(X)\biggr{)}b_{j}(X)\biggr{|}\biggr{|}_{\mathbb{P}_{n},2}.

l_{t,j}^{k}=\biggl{|}\biggl{|}\biggl{(}1\{T\leq t\}-\hat{F}_{t}^{k-1}(X)\biggr{)}b_{j}(X)\biggr{|}\biggr{|}_{\mathbb{P}_{n},2}.

\hat{f}_{t} (X) = \frac{F ^ _{t + h_{1}} ( X ) - F ^ _{t - h_{1}} ( X )}{2 h _{1}},

\hat{f}_{t} (X) = \frac{F ^ _{t + h_{1}} ( X ) - F ^ _{t - h_{1}} ( X )}{2 h _{1}},

(t, u) \in T U sup [∣∣ r_{t, u}^{ν} (X) K (\frac{T - t}{h _{1}})^{1/2} ∣ ∣_{P_{n}, 2} + ∣∣ r_{t, u}^{ϕ} (X) K (\frac{T - t}{h _{1}})^{1/2} ∣ ∣_{P_{n}, 2}] = O_{p} ((s lo g (p \lor n) / n)^{1/2}) .

(t, u) \in T U sup [∣∣ r_{t, u}^{ν} (X) K (\frac{T - t}{h _{1}})^{1/2} ∣ ∣_{P_{n}, 2} + ∣∣ r_{t, u}^{ϕ} (X) K (\frac{T - t}{h _{1}})^{1/2} ∣ ∣_{P_{n}, 2}] = O_{p} ((s lo g (p \lor n) / n)^{1/2}) .

(t, u) \in T U sup [∣∣ r_{t, u}^{ν} (X) ∣ ∣_{P, \infty} + ∣∣ r_{t, u}^{ϕ} (X) ∣ ∣_{P, \infty}] = O ((lo g (p \lor n) s^{2} ζ_{n}^{2} / (n h_{1}))^{1/2}) .

(t, u) \in T U sup [∣∣ r_{t, u}^{ν} (X) ∣ ∣_{P, \infty} + ∣∣ r_{t, u}^{ϕ} (X) ∣ ∣_{P, \infty}] = O ((lo g (p \lor n) s^{2} ζ_{n}^{2} / (n h_{1}))^{1/2}) .

\int u K (u) d u = 0, and κ_{2} := \int u^{2} K (u) d u < \infty.

\int u K (u) d u = 0, and κ_{2} := \int u^{2} K (u) d u < \infty.

0 < κ^{'} \leq δ \neq = 0, ∣∣ δ ∣ ∣_{0} \leq s ℓ_{n} in f \frac{∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ ∣ ∣ _{2}} \leq δ \neq = 0, ∣∣ δ ∣ ∣_{0} \leq s ℓ_{n} sup \frac{∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ ∣ ∣ _{2}} \leq κ^{^{''}} < \infty.

0 < κ^{'} \leq δ \neq = 0, ∣∣ δ ∣ ∣_{0} \leq s ℓ_{n} in f \frac{∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ ∣ ∣ _{2}} \leq δ \neq = 0, ∣∣ δ ∣ ∣_{0} \leq s ℓ_{n} sup \frac{∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ ∣ ∣ _{2}} \leq κ^{^{''}} < \infty.

δ \neq = 0, ∣∣ δ ∣ ∣_{0} \leq s ℓ_{n} in f \frac{∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ ∣ ∣ _{2}} and δ \neq = 0, ∣∣ δ ∣ ∣_{0} \leq s ℓ_{n} sup \frac{∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ ∣ ∣ _{2}}

δ \neq = 0, ∣∣ δ ∣ ∣_{0} \leq s ℓ_{n} in f \frac{∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ ∣ ∣ _{2}} and δ \neq = 0, ∣∣ δ ∣ ∣_{0} \leq s ℓ_{n} sup \frac{∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ ∣ ∣ _{2}}

(t, u) \in T U in f δ \in Δ_{c, t, u} in f \frac{∣∣ b ( X ) ^{'} δ K ( \frac{T - t}{h _{1}} ) ^{1/2} ∣ ∣ _{P_{n}, 2}}{∣∣ δ _{S_{t, u}} ∣ ∣ _{2} h _{1}} \geq \underline{κ} .

(t, u) \in T U in f δ \in Δ_{c, t, u} in f \frac{∣∣ b ( X ) ^{'} δ K ( \frac{T - t}{h _{1}} ) ^{1/2} ∣ ∣ _{P_{n}, 2}}{∣∣ δ _{S_{t, u}} ∣ ∣ _{2} h _{1}} \geq \underline{κ} .

(t, u) \in T U in f δ \in Δ_{c, t, u} in f \frac{s ∣∣ b ( X ) ^{'} δ K ( \frac{T - t}{h _{1}} ) ^{1/2} ∣ ∣ _{P_{n}, 2}}{∣∣ δ _{S_{t, u}} ∣ ∣ _{1} h _{1}} \geq \underline{κ} .

(t, u) \in T U in f δ \in Δ_{c, t, u} in f \frac{s ∣∣ b ( X ) ^{'} δ K ( \frac{T - t}{h _{1}} ) ^{1/2} ∣ ∣ _{P_{n}, 2}}{∣∣ δ _{S_{t, u}} ∣ ∣ _{1} h _{1}} \geq \underline{κ} .

(t, u) \in T U in f δ \in Δ_{c, t, u} in f \frac{s ∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ _{S_{t, u}} ∣ ∣ _{1}} \geq (t, u) \in T U in f δ \in Δ_{c, t, u} in f \frac{∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ _{S_{t, u}} ∣ ∣ _{2}} \geq \underline{κ},

(t, u) \in T U in f δ \in Δ_{c, t, u} in f \frac{s ∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ _{S_{t, u}} ∣ ∣ _{1}} \geq (t, u) \in T U in f δ \in Δ_{c, t, u} in f \frac{∣∣ b ( X ) ^{'} δ ∣ ∣ _{P_{n}, 2}}{∣∣ δ _{S_{t, u}} ∣ ∣ _{2}} \geq \underline{κ},

t \in T sup ∣∣ (ν_{t} (X) - ν_{t} (X)) ∣ ∣_{P_{n}, 2} = O_{p} (ℓ_{n} (lo g (p \lor n) s)^{1/2} (n h_{1})^{- 1/2}),

t \in T sup ∣∣ (ν_{t} (X) - ν_{t} (X)) ∣ ∣_{P_{n}, 2} = O_{p} (ℓ_{n} (lo g (p \lor n) s)^{1/2} (n h_{1})^{- 1/2}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Causal Inference Techniques · Statistical Methods and Bayesian Inference

Full text

Non-separable Models with High-dimensional Data††thanks:

First draft: February, 2017. We are grateful to Alex Belloni, Xavier D’Haultfœuille, Michael Qingliang Fan, Bryan Graham, Yu-Chin Hsu, Yuya Sasaki, and seminar participants at Academia Sinica, Duke, Asian Meeting of the Econometric Society, China Meeting of the Econometric Society, and the 7th Shanghai Workshop of Econometrics. Su acknowledges the funding support provided by the Lee Kong Chian Fund for Excellence.

Liangjun Su

School of Economics, Singapore Management University, 90 Stamford Road, Singapore 178903. E-mail: [email protected].

Takuya Ura

Department of Economics, University of California, Davis. One Shields Avenue, Davis, CA 95616. E-mail: [email protected].

Yichong Zhang

School of Economics, Singapore Management University, 90 Stamford Road, Singapore 178903. E-mail: [email protected].

Abstract

This paper studies non-separable models with a continuous treatment when the dimension of the control variables is high and potentially larger than the effective sample size. We propose a three-step estimation procedure to estimate the average, quantile, and marginal treatment effects. In the first stage we estimate the conditional mean, distribution, and density objects by penalized local least squares, penalized local maximum likelihood estimation, and numerical differentiation, respectively, where control variables are selected via a localized method of $L_{1}$ -penalization at each value of the continuous treatment. In the second stage we estimate the average and marginal distribution of the potential outcome via the plug-in principle. In the third stage, we estimate the quantile and marginal treatment effects by inverting the estimated distribution function and using the local linear regression, respectively. We study the asymptotic properties of these estimators and propose a weighted-bootstrap method for inference. Using simulated and real datasets, we demonstrate that the proposed estimators perform well in finite samples.

Keywords: Average treatment effect, High dimension, Least absolute shrinkage and selection operator (Lasso), Nonparametric quantile regression, Nonseparable models, Quantile treatment effect, Unconditional average structural derivative

JEL codes: C21, J62

1 Introduction

Non-separable models without additivity appear frequently in econometric analyses, because economic theory motivates a nonlinear role of the unobserved individual heterogeneity (Altonji and Matzkin, 2005) and its multi-dimensionality (Browning and Carro, 2007; Carneiro et al., 2003; Cunha et al., 2010). A large fraction of the previous literature on non-separable models has used control variables to achieve the unconfoundedness condition (Rosenbaum and Rubin, 1983), that is, the conditional independence between a regressor of interest (or a treatment) and the unobserved individual heterogeneity given the control variables. Although including high-dimensional control variables make unconfoundedness more plausible, the estimation and inference become more challenging, as well. It remains unanswered how to select control variables among potentially very many variables and conduct proper statistical inference for parameters of interest in non-separable models with a continuous treatment.

This paper proposes estimation and inference for unconditional parameters,111To be more specific, the parameters of interest are unconditional on covariates but conditional on the treatment level. including unconditional means of the potential outcomes, the unconditional cumulative distribution function, the unconditional quantile function, and the unconditional quantile partial derivative with the presence of both continuous treatment and high-dimensional covariates.222We focus on unconditional parameters, in which (potentially high-dimensional) covariates are employed to achieve the unconfoundedness but the parameters of interest are unconditional on the covariates. Unconditional parameters are simple to display and the simplicity is crucial especially when the covariates are high dimensional. As emphasized in Frölich and Melly (2013) and Powell (2010), unconditional parameters have two additional attractive features. First, by definition, they capture all the individuals in the sample at the same time instead of investigating the underlying structure separately for each subgroup defined by the covariates $X$ . The treatmen effect for the whole population is more policy-relevant. Second, an estimator for unconditional parameters can have better finite/large sample properties. The proposed method estimates the parameters of interest in three stages. The first stage selects controls by the method of least absolute shrinkage and selection operator (Lasso) and predicts reduced-form parameters such as the conditional expectation and distribution of the outcome given the variables and treatment level and the conditional density of the treatment given the control variables. We allow for different control variables to be selected at different values of the continuous treatment. The second stage recovers the average and the marginal distribution of the potential outcome by plugging the reduced-form parameters into doubly robust moment conditions. The last stage recovers the quantile of the potential outcome and its derivative with respect to the treatment by inverting the estimated distribution function and using the local linear regression, respectively. The inference is implemented via a weighted-bootstrap without recalculating the first stage variable selections, which saves considerable computation time.

To motivate our parameters of interest, we relate our estimands (the population objects that our procedure aims to recover) with the structural outcome function. Notably, we extend Hoderlein and Mammen (2007) and Sasaki (2015) to demonstrate that the unconditional derivative of the quantile of the potential outcome with respect to the treatment is equal to the weighted average of the marginal effects over individuals with same outcomes and treatments.

This paper contributes to two important strands of the econometric literature. The first is the literature on non-separable models with a continuous treatment, in which previous analyses have focused on a fixed and small number of control variables; see, e.g., Chesher (2003), Chernozhukov et al. (2007), Hoderlein and Mammen (2007), Imbens and Newey (2009), Matzkin (1994) and Matzkin (2003). The second is a growing literature on recovering the causal effect from the high-dimensional data; see, e.g., Belloni et al. (2012), Belloni et al. (2014a), Chernozhukov et al. (2015a), Chernozhukov et al. (2015b), Farrell (2015), Athey and Imbens (2016), Chernozhukov et al. (2017), Belloni et al. (2014b), Wager and Athey (2018), Belloni et al. (2017a), and Belloni et al. (2017b). Our paper complements the previous works by studying both the variable selection and post-selection inference of causal parameters in a non-separable model with a continuous treatment. Recently, Cattaneo et al. (2016), Cattaneo et al. (2018a), and Cattaneo et al. (2018b) have considered the semiparametric estimation of the causal effect in a setting with many included covariates and proposed novel bias-correction methods to conduct valid inference. Comparing with them, we deal with the fully nonparametric model with an ultra-high dimension of potential covariates, and rely on the approximate sparsity to reduce dimensionality.

The treatment variable being continuous imposes difficulties in both variable selection and post-selection inference. To address the former, we use penalized local Maximum Likelihood and Least Square estimations (hereafter, MLE and LS, respectively) to select control variables for each value of the continuous treatment. The penalized local LS was previously studied by Kong et al. (2015) and Lee and Mammen (2016).333We thank the referee for the reference. The local MLE complements the LS method by estimating a nonlinear and high-dimensional model with varying coefficients indexed by not only the continuous treatment variable but also a location variable. Our approach directly extend the distribution regression proposed in Chernozhukov et al. (2013) to the high-dimensional varying coefficient setting. By relying on kernel smoothing method, we require a different penalty loading than the traditional Lasso method. Chu et al. (2011) and Ning and Liu (2017) develop general theories of estimation, inference, and hypothesis testing of penalized (Pseudo) MLE. We complement their results by considering the local likelihood with an $L_{1}$ penalty term. Belloni et al. (2018a) construct uniformly valid confidence bands for the Z-estimators of unconditional moment equalities. Our results are not covered by theirs, either, as our parameters are defined based on conditional moment equalities. To prove the statistical properties of the penalized local MLE, we establish a local version of the compatibility condition (Bühlmann and van de Geer, 2011), which itself is new to the best of our knowledge.

For the post-selection inference, we establish doubly robust moment conditions for the continuous treatment effect model. Our parameters of interest is irregularly identified by the definition in Khan and Tamer (2010), as they are identified by a thin-set. Therefore, by averaging observations only when their treatment levels are close to the one of interest, the convergence rates of our estimators are nonparametric, which is in contrast with the $\sqrt{n}$ -rate obtained in Belloni et al. (2017a) and Farrell (2015). Albeit motivated by distinct models, Belloni et al. (2016) also estimate the irregular identified parameters in the high-dimensional setting. However, the irregularity faced by Belloni et al. (2016) is not due to the continuity of the variable of interest. Consequently, Belloni et al. (2016) do not study the regularized estimator with localization as we do in this paper.

Estimation based on doubly robust moments is also related to the literature of semiparametric efficiency. The idea of doubly robust estimation can be traced back to the nonparametric efficiency theory for functional estimation developed by Begun et al. (1983), Pfanzagl (1990), Bickel et al. (1993), and Newey (1994). Robins and Rotnitzky (2001) and van der Laan and Robins (2003) study the semiparametric doubly robust estimators by modeling both the treatment and outcome processes. van der Laan and Dudoit (2003) allow for nonparametric modeling in causal inference problems. When both processes are nonparametrically estimated, the doubly robust methods can achieve faster rates of convergence than their nuisance estimator, making the estimator less sensitive to the curse of dimensionality and model selection bias. Their use in causal inference is also considered by Robins and Rotnitzky (1995), Hahn (1998), van der Laan and Robins (2003), Hirano et al. (2003), van der Laan and Rubin (2006), Firpo (2007), Tsiatis (2007), van der Laan and Rose (2011), Kennedy et al. (2017), and Robins et al. (2017), among others.

Among the works above, our paper is most closely related to Kennedy et al. (2017), who consider the doubly robust estimation for the average treatment effect when the treatment variable is continuous. Our paper complements theirs in four aspects. First, the estimation procedures are different. Kennedy et al. (2017) first estimate the efficient influence function for the weighted average of the mean effect over all treatment levels, and then, use kernel smoothing to estimate the mean effect at each treatment level. On the contrary, we directly consider the doubly-robust moment for the parameters of interest. Second, Kennedy et al. (2017) mainly focus on the mean effect, while we also consider quantile and marginal treatment effects. We obtain linear expansions for our estimators uniformly over both the quantile index and the treatment variable. Third, Kennedy et al. (2017) do not construct detailed estimators of their nuisance parameters, but instead, impose high-level assumptions. To verify such high-level assumptions in the high-dimensional setting is nontrivial. In contrast, we provide valid estimators for our nuisance parameters via both regularization and localization, and derive their statistical properties. Fourth, we take into account the fact that the dimension of covariates may increase with the sample size so that the complexity of our nuisance parameter estimator measured by the uniform entropy will diverge to infinity. Such a situation is ruled out by Kennedy et al. (2017).

To obtain uniformly valid results over values of the continuous treatment, we derive linear expansions of the rearrangement operator for a local process which is not tight, extending the existing results in Chernozhukov et al. (2010).

We study the finite sample performance of our estimation procedure via Monte Carlo simulations and an empirical application. The simulations suggest that the proposed estimators perform reasonably well in finite samples. In the empirical exercise, we estimate the distributional effect of parental income on son’s income and intergenerational elasticity using the 1979 National Longitudinal Survey of Youth (NLSY79). We control for a large dimension of demographic variables. The quantiles of son’s potential income are in general upward slopping with respect to parental income, but for the subsample of blacks, the intergenerational elasticities are not statistically significant.

The rest of this paper is organized as follows. Section 2 presents the model and the parameters of interest. Section 3 proposes an estimation method in the presence of high-dimensional covariates. Section 4 demonstrates the validity of a bootstrap inference procedure. Section 5 presents Monte Carlo simulations. Section 6 illustrates the proposed estimator using NLSY79. Section 7 concludes. Proofs of the main theorems and Lemma 3.1 are reported in the appendix. Proofs of the rest of the lemmas are collected in an online supplement.

Throughout this paper, we adopt the convention that the capital letters, such as $A$ , $Y$ , $X$ , denote random elements while their corresponding lower cases denote realizations. $C$ denotes an arbitrary positive constant that may not be the same in different contexts. For a sequence of random variables $\{U_{n}\}_{n=1}^{\infty}$ and a random variable $U$ , $U_{n}\rightsquigarrow U$ indicates weak convergence in the sense of van der Vaart and Wellner (1996). When $U_{n}$ and $U$ are $k$ -dimensional elements, the space of the sample path is $\Re^{k}$ equipped with Euclidean norm. When $U_{n}$ and $U$ are stochastic processes, the space of sample path is $L^{\infty}(\{v\in\Re^{k}:|v|<B\})$ for some positive $B$ equipped with sup norm. The letters $\mathbb{P}_{n}$ , $\mathbb{P}$ , and $\mathcal{U}_{n}$ denote the empirical process, expectation, and U-process, respectively. In particular, $\mathbb{P}_{n}$ assigns probability $\frac{1}{n}$ to each observation and $\mathcal{U}_{n}$ assigns probability $\frac{1}{n(n-1)}$ to each pair of observations. $\mathbb{E}$ also denotes expectation. We use $\mathbb{P}$ and $\mathbb{E}$ exchangeably. For any positive (random) sequence $(u_{n},v_{n})$ , if there exists a positive constant $C$ independent of $n$ such that $u_{n}\leq Cv_{n}$ , then we write $u_{n}\lesssim v_{n}$ . $||\cdot||_{Q,q}$ denotes $L^{q}$ norm under measure $Q$ , where $q=1,2,\infty$ . If measure $Q$ is omitted, the underlying measure is assumed to be the counting measure. For any vector $\theta$ , $||\theta||_{0}$ denotes the number of its nonzero coordinates. $\text{Supp}(\theta)$ , the support of a $p$ -dimensional vector $\theta$ , is defined as $\{j:\theta_{j}\neq 0\}$ . For $T\subset\{1,2,\cdots,p\}$ , let $|T|$ be the cardinality of $T$ , $T^{c}$ be the complement of $T$ , and $\theta_{T}$ be the vector in $\Re^{p}$ that has the same coordinates as $\theta$ on $T$ and zero coordinates on $T^{c}$ . Last, let $a\vee b=\max(a,b)$ .

2 Model and Parameters of Interest

Econometricians observe an outcome $Y$ , a continuous treatment $T$ , and a set of covariates $X$ , which may be high-dimensional. They are connected by a measurable function $\Gamma(\cdot)$ , i.e.,

[TABLE]

where $A$ is an unobservable random vector and may not be weakly separable from observables $(T,X)$ , and $\Gamma$ may not be monotone in either $T$ or $A$ .

Let $Y(t)=\Gamma(t,X,A)$ . We are interested in the average $\mathbb{E}Y(t)$ , the marginal distribution $\mathbb{P}(Y(t)\leq u)$ for some $u\in\Re$ , and the quantile $q_{\tau}(t)$ , where we denote $q_{\tau}(t)$ as the $\tau$ -th quantile of $Y(t)$ for some $\tau\in(0,1)$ . We are also interested in the causal effect of moving $T$ from $t$ to $t^{\prime}$ , i.e., $\mathbb{E}(Y(t)-Y(t^{\prime}))$ and $q_{\tau}(t)-q_{\tau}(t^{\prime})$ . Last, we are interested in the average marginal effect $\mathbb{E}[\partial_{t}\Gamma(t,X,A)]$ and quantile partial derivative $\partial_{t}q_{\tau}(t)$ . Next, we specify conditions under which the above parameters are identified.

Assumption 1

The random variables $A$ and $T$ are conditionally independent given $X$ .

Assumption 1 is known as the unconfoundedness condition, which is commonly assumed in the treatment effect literature. See Cattaneo (2010), Cattaneo and Farrell (2011), Hirano et al. (2003) and Firpo (2007) for the case of discrete treatment and Graham et al. (2014), Galvao and Wang (2015), and Hirano and Imbens (2004) for the case of continuous treatment. It is also called the conditional independence assumption in Hoderlein and Mammen (2007), which is weaker than the full joint independence between $A$ and $(T,X)$ . Note that $X$ can be arbitrarily correlated with the unobservables $A$ . This assumption is more plausible when we control for sufficiently many and potentially high-dimensional covariates.

Theorem 2.1

Suppose Assumption 1 holds and $\Gamma(\cdot)$ is differentiable in its first argument. Then the marginal distribution of $Y(t)$ and the average marginal effect $\partial_{t}\mathbb{E}Y(t)$ are identified. In addition, if Assumption 6 in the Appendix holds and $X$ is continuously distributed, then $\partial_{t}q_{\tau}(t)=\mathbb{E}_{\mu_{\tau,t}}[\partial_{t}\Gamma(t,X,A)]$ , where, for $f_{(X,A)}$ denoting the joint density of $(X,A)$ , $\mu_{\tau,t}$ is the probability measure on $\{(x,a):\Gamma(t,x,a)=q_{\tau}(t)\}$ with density $\frac{f_{(X,A)}}{c_{f}\|\nabla_{(x,a)}\Gamma(t,\cdot,\cdot)\|}$ , where

[TABLE]

Several comments are in order. First, because the marginal distribution of $Y(t)$ is identified, so be its average, quantile, average marginal effect, and quantile partial derivative. As pointed out by Imbens and Newey (2009), a non-separable outcome with a general disturbance is equivalent to treatment effect models. Therefore, we can view $Y(t)$ as the potential outcome. Under unconfoundedness, the identification of the marginal distribution of the potential outcome with a continuous treatment has already been established in Hirano and Imbens (2004) and Galvao and Wang (2015). The first part of Theorem 2.1 just re-states their results. Second, the second result indicates that the partial quantile derivative identifies the weighted average marginal effect for the subpopulation with the same potential outcome, i.e., $\{Y(t)=q_{\tau}(t)\}.$ The result is closely related to, but different from Sasaki (2015). We consider the unconditional quantile of $Y(t)$ , whereas he considered the conditional quantile of $Y(t)$ given $X$ . Note that $q_{\tau}(t)$ is not the average of the conditional quantile of $Y(t)$ given $X$ . Third, we require $X$ to be continuous just for the simplicity of derivation. If some elements of $X$ are discrete, a similar result can be established in a conceptually straightforward manner by focusing on the continuous covariates within samples homogenous in the discrete covariates, at the expense of additional notation. Finally, we do not require $X$ to be continuous when establishing the estimation and inference results below.

3 Estimation

Let $f_{t}(x)=f_{T|X}(t|x)$ denote the conditional density of $T$ evaluated at $t$ given $X=x$ and $d_{t}(\cdot)$ denote the Dirac function such that for any function $g(\cdot)$ ,

[TABLE]

In addition, let $Y_{u}(t)=1\{Y(t)\leq u\}$ and $Y_{u}=1\{Y\leq u\}$ for some $u\in\Re$ . Then $\mathbb{E}(Y(t))$ and $\mathbb{E}(Y_{u}(t))$ can be identified by the method of generalized propensity score as proposed in Hirano and Imbens (2004), i.e.,

[TABLE]

There is a direct analogy between (3.1) for the continuous treatment and $\mathbb{E}(Y_{u}(t))=\mathbb{E}(\frac{Y_{u}1\{T=t\}}{\mathbb{P}(T=t|X)})$ when the treatment $T$ is discrete: the indicator function shrinks to a Dirac function and the propensity score is replaced by the conditional density. Following this analogy, Hirano and Imbens (2004) called $f_{t}(X)$ the generalized propensity.

Belloni et al. (2017a) and Farrell (2015) considered the model with a discrete treatment and high-dimensional control variables, and proposed to use the doubly robust moment for inference. Following their lead, we propose the corresponding doubly robust moment when the treatment status is continuous. Let $\nu_{t}(x)=\mathbb{E}(Y|X=x,T=t)$ and $\phi_{t,u}(x)=\mathbb{E}(Y_{u}|X=x,T=t)$ , then

[TABLE]

and

[TABLE]

We propose the following three-stage procedure to estimate $\mu(t):=\mathbb{E}Y(t)$ , $\alpha(t,u):=\mathbb{P}(Y(t)\leq u)$ , $q_{\tau}(t)$ , and $\partial_{t}q_{\tau}(t)$ :

Estimate $\nu_{t}(x)$ , $\phi_{t,u}(x)$ , and $f_{t}(x)$ by $\widehat{\nu}_{t}(x)$ , $\widehat{\phi}_{t,u}(x)$ and $\hat{f}_{t}(x)$ , respectively, using the first-stage bandwidth $h_{1}$ .

2.

Estimate $\mu(t)$ and $\alpha(t,u)$ by

[TABLE]

and

[TABLE]

where $K(\cdot)$ and $h_{2}$ are a kernel function and the second-stage bandwidth, respectively. Then rearrange $\hat{\alpha}(t,u)$ to obtain $\hat{\alpha}^{r}(t,u)$ , which is monotone in $u$ .

3

Estimate $q_{\tau}(t)$ by inverting $\hat{a}^{r}(t,u)$ with respect to (w.r.t.) $u$ , i.e., $\hat{q}_{\tau}(t)=\inf\{u:\hat{a}^{r}(t,u)\geq\tau\};$ estimate $\partial_{t}\mu(t)=\mathbb{E}\partial_{t}\Gamma(t,X,A)$ by $\breve{\beta}^{1}(t)$ , which is the estimator of the slope coefficient in the local linear regression of $\hat{\mu}(T_{i})$ on $T_{i}$ ; estimate $\partial_{t}q_{\tau}(t)$ by $\hat{\beta}_{\tau}^{1}(t)$ , which is the estimator of the slope coefficient in the local linear regression of $\hat{q}_{\tau}(T_{i})$ on $T_{i}$ .

3.1 The First Stage Estimation

In this section, we define the first stage estimators and derive their asymptotic properties. Since $\nu_{t}(x)$ , $\phi_{t,u}(x)$ , and $f_{t}(x)$ are local parameters w.r.t. $T=t$ , in addition to using $L_{1}$ penalty to select relevant covariates, we rely on a kernel function to implement the localization. In particular, we propose to estimate $\nu_{t}(x)$ , $\phi_{t,u}(x)$ , and $f_{t}(x)$ by a penalized local LS, a penalized local MLE, and numerical differentiation, respectively.

3.1.1 Penalized Local LS and MLE

Recall $\nu_{t}(x)=\mathbb{E}(Y|X=x,T=t)$ and $\phi_{t,u}(x)=\mathbb{E}(Y_{u}|X=x,T=t)$ where $Y_{u}=1\{Y\leq u\}$ . We approximate $\nu_{t}(x)$ and $\phi_{t,u}(x)$ by $b(x)^{\prime}\gamma_{t}$ and $\Lambda(b(x)^{\prime}\theta_{t,u})$ , respectively, where $\Lambda(\cdot)$ is the logistic CDF and $b(X)$ is a $p\times 1$ vector of basis functions with potentially large $p$ . In the case of high-dimensional covariates, $b(X)$ is just $X$ , while in the case of nonparametric sieve estimation, $b(X)$ is a series of bases of $X$ . The approximation errors for $\nu_{t}(x)$ and $\phi_{t,u}(x)$ are given by $r_{t}^{\nu}(x)=\nu_{t}(x)-b(x)^{\prime}\gamma_{t}$ and $r_{t,u}^{\phi}(x)=\phi_{t,u}(x)-\Lambda(b(x)^{\prime}\theta_{t,u}),$ respectively.

Note that we only approximate $\nu_{t}(x)$ and $\phi_{t,u}(x)$ by a linear regression and a logistic regression, respectively, with the approximation errors satisfying Assumption 2 below. Assumption 2 below puts a sparsity structure on $\nu_{t}(x)$ and $\phi_{t,u}(x)$ so that the number of effective covariates that can affect them is much smaller than $p$ . If the effective covariates are a few discrete variables that have a few categories, then we can saturate the regressions by low-dimensional dummy variables so that there is no approximate error. If some of the effective covariates are continuous, then we can include sieve bases in the linear regression so that the approximation error can still satisfy Assumption 2. One possible scenario that the approximate sparsity condition may fail is when there are a substantial amount of discrete variables that are all on the same footing (e.g., job occupation dummies). In this case, it is hard to define a sparse approximation.444We thank the Associate Editor for this point. Last, the coefficients $\gamma_{t}$ and $\theta_{t,u}$ are both functional parameters that can vary with their indexes. This provides additional flexibility of our setup against misspecification.

We estimate $\nu_{t}(x)$ and $\phi_{t,u}(x)$ by $\widehat{\nu}_{t}(x)=b(x)^{\prime}\hat{\gamma}_{t}$ and $\widehat{\phi}_{t,u}(x)=\Lambda(b(x)^{\prime}\hat{\theta}_{t,u})$ , respectively, where

[TABLE]

$\left\|\cdot\right\|_{1}$ denotes the $L_{1}$ norm, $h_{1}$ is the first-stage bandwidth, $\lambda=\ell_{n}(\log(p\vee nh_{1})nh_{1})^{1/2}$ for some slowly diverging sequence $\ell_{n}$ , and $M(y,x;g)=-[y\log(\Lambda(b(x)^{\prime}g))+(1-y)\log(1-\Lambda(b(x)^{\prime}g))]$ . Our penalty term $\lambda$ is different from the one used in Belloni et al. (2017a) and Belloni et al. (2018b), i.e., $\lambda^{*}=1.1\Phi^{-1}(1-\gamma/p)n^{1/2}$ , where $\gamma=o(1)$ is some user-supplied constant, and $\Phi(\cdot)$ is the standard normal CDF. Belloni et al. (2017a) suggest $\gamma=C/(n\log(n))$ , which implies that

[TABLE]

Therefore, our penalty term $\lambda$ is of same order of magnitude of $\lambda^{*}$ if $nh_{1}$ is replaced with $n$ and $\ell_{n}$ is removed. We need to use $nh_{1}$ in our penalty due to the presence of the kernel function in our estimation procedure. In particular, the effective sample size is of the same order of $nh_{1}$ .555Note that $\log(n)$ and $\log(nh_{1})$ are of the same order of magnitude. We will specify the order of magnitude of $h_{1}$ in Assumption 2. The role played by $\ell_{n}$ in our penalty is similar to that of $\gamma$ in $\lambda^{*}$ , which is to control the selection error uniformly. We refer readers to Belloni et al. (2017a, Equation (6.4)) for a more detailed discussion on this point. Since we do not use the advanced technique of self-normalized process as in Belloni et al. (2017a), we multiply the sequence $\ell_{n}$ with $\sqrt{\log(p\vee n)}$ while in $\lambda^{*}$ , $\log(\gamma)$ is additive to $\log(pn)$ inside the square root. We propose a rule-of-thumb $\lambda$ in Section 5 and study the sensitivity of our inference method against the choice of $\lambda$ in Section D of the supplementary material.

In (3.4) and (3.5), $\widehat{\Xi}_{t}=\text{diag}(\tilde{l}_{t,1},\cdots,\tilde{l}_{t,p})$ and $\widehat{\Psi}_{t,u}=\text{diag}(l_{t,u,1},\cdots,l_{t,u,p})$ are generic penalty loading matrices. The infeasible loading matrices we would like to use are $\widehat{\Xi}_{t,0}=\text{diag}(\tilde{l}_{t,0,1},\cdots,\tilde{l}_{t,0,p})$ and $\widehat{\Psi}_{t,u,0}=\text{diag}(l_{t,u,0,1},\cdots,l_{t,u,0,p})$ in which

[TABLE]

and

[TABLE]

respectively. Since $\nu_{t}(\cdot)$ and $\phi_{t,u}(\cdot)$ are not known, we follow Belloni et al. (2017a) and propose an iterative algorithm to obtain the feasible versions of the loading matrices. The statistical properties of the feasible loading matrices are summarized in Lemma A.8 in the Appendix.

Algorithm 3.1

Let $\widehat{\Xi}_{t}^{0}=\text{diag}(\tilde{l}_{t,1}^{0},\cdots,\tilde{l}_{t,p}^{0})$ and $\widehat{\Psi}_{t,u}^{0}=\text{diag}(l_{t,u,1}^{0},\cdots,l_{t,u,p}^{0})$ , where $\tilde{l}_{t,j}^{0}=||Yb_{j}(X)K(\frac{T-t}{h_{1}})h_{1}^{-1/2}||_{\mathbb{P}_{n},2}$ and $l_{t,u,j}^{0}=||Y_{u}b_{j}(X)K(\frac{T-t}{h_{1}})h_{1}^{-1/2}||_{\mathbb{P}_{n},2}.$ Using $\widehat{\Xi}_{t}^{0}$ and $\widehat{\Psi}_{t,u}^{0}$ , we can compute $\hat{\gamma}_{t}^{0}$ and $\hat{\theta}_{t,u}^{0}$ by (3.4) and (3.5). Let $\widehat{\nu}_{t}^{0}(x)=b(x)^{\prime}\hat{\gamma}_{t}^{0}$ and $\widehat{\phi}_{t,u}^{0}(x)=\Lambda(b(x)^{\prime}\hat{\theta}_{t,u}^{0})$ for $x=X_{1},...,X_{n}.$ 2. 2.

For $k=1,\cdots,K$ for some fixed positive integer $K$ , we compute $\widehat{\Xi}_{t}^{k}=\text{diag}(\tilde{l}_{t,1}^{k},\cdots,\tilde{l}_{t,p}^{k})$ and $\widehat{\Psi}_{t,u}^{k}=\text{diag}(l_{t,u,1}^{k},\cdots,l_{t,u,p}^{k}),$ where

[TABLE]

and

[TABLE]

Using $\widehat{\Xi}_{t}^{k}$ and $\widehat{\Psi}_{t,u}^{k}$ , we can compute $\hat{\gamma}_{t}^{k}$ and $\hat{\theta}_{t,u}^{k}$ by (3.4) and (3.5). Let $\widehat{\nu}_{t}^{k}(x)=b(x)^{\prime}\hat{\gamma}_{t}^{k}$ and $\widehat{\phi}_{t,u}^{k}(x)=\Lambda(b(x)^{\prime}\hat{\theta}_{t,u}^{k})$ for $x=X_{1},...,X_{n}.$ The final penalty loading matrices $\widehat{\Xi}_{t}^{K}$ and $\widehat{\Psi}_{t,u}^{K}$ will be used for $\widehat{\Xi}_{t}$ and $\widehat{\Psi}_{t,u}$ in (3.4) and (3.5).

Let $\widetilde{\mathcal{S}}_{t}^{\mu}$ and $\widetilde{\mathcal{S}}_{t,u}$ contain the supports of $\hat{\gamma}_{t}$ and $\hat{\theta}_{t,u}$ , respectively, such that $|\widetilde{\mathcal{S}}_{t}^{\mu}|\lesssim\sup_{t\in\mathcal{T}}||\widehat{\gamma}_{t}||_{0}$ , and $|\widetilde{\mathcal{S}}_{t,u}|\lesssim\sup_{(t,u)\in\mathcal{T}\mathcal{U}}||\widehat{\theta}_{t,u}||_{0}$ . For each $(t,u)\in\mathcal{T}\mathcal{U}:=\mathcal{T}\times\mathcal{U}$ where $\mathcal{T}$ and $\mathcal{U}$ are compact subsets of the supports of $T$ and $Y$ , respectively, the post-Lasso estimator of $\gamma_{t}$ and $\theta_{t,u}$ based on the set of covariates $\widetilde{\mathcal{S}}_{t}^{\mu}$ and $\widetilde{\mathcal{S}}_{t,u}$ are defined as

[TABLE]

and

[TABLE]

The post-Lasso estimators of $\nu_{t}(x)$ and $\phi_{t,u}(X)$ are given by $\widetilde{\nu}_{t}(X)=b(X)^{\prime}\tilde{\gamma}_{t}$ and $\widetilde{\phi}_{t,u}(X)=\Lambda(b(X)^{\prime}\tilde{\theta}_{t,u})$ , respectively.

3.1.2 Conditional Density Estimation

Following Belloni et al. (2018b), we propose to first estimate $F_{t}(X)$ , the conditional CDF of $T$ given $X$ , by the (logistic) distributional lasso regression studied in Belloni et al. (2017a) and then take the numerical derivative. Following Belloni et al. (2017a), we approximate $F_{t}(X)$ by a Logistic CDF $\Lambda(b(X)^{\prime}\beta_{t})$ and the approximation error is denoted as $r_{t}^{F}(x)=F_{t}(x)-\Lambda(b(x)^{\prime}\beta_{t})$ . We estimate $\beta_{t}$ by $\hat{\beta}_{t}$ , which is computed as

[TABLE]

where $M(\cdot)$ is the logistic likelihood as defined previously, the penalty

[TABLE]

is slightly modified from but of the same order of magnitude as $\lambda^{*}$ used in Belloni et al. (2017a) and Belloni et al. (2018b), for some $\gamma\rightarrow 0$ specified in Section 5, and the penalty loading $\hat{\Psi}_{t}$ is estimated in Algorithm 2 below, which is also due to Belloni et al. (2017a):

Algorithm 3.2

Let $\widehat{\Psi}_{t}^{0}=\text{diag}(l_{t,1}^{0},\cdots,l_{t,p}^{0})$ where $l_{t,j}^{0}=||1\{T\leq t\}b_{j}(X)||_{\mathbb{P}_{n},2}.$ Using $\widehat{\Psi}_{t}^{0}$ , we can compute $\hat{\beta}_{t}^{0}$ and $\hat{F}_{t}(X)$ by the (logistic) distributional lasso regression. 2. 2.

For $k=1,\cdots,K$ , we compute $\widehat{\Psi}_{t}^{k}=\text{diag}(l_{t,1}^{k},\cdots,l_{t,p}^{k})$ where

[TABLE]

Using $\widehat{\Psi}_{t}^{k}$ , we can compute $\hat{\beta}_{t}^{k}$ and $\hat{F}_{t}^{k}(X)$ by the (logistic) distributional lasso regression. The final penalty loading matrix $\widehat{\Psi}_{t}^{K}$ will be used as $\widehat{\Psi}_{t}$ in (3.6).

Then, $f_{t}(X)$ , the conditional density of $T=t$ give $X$ is computed as

[TABLE]

where $h_{1}$ is the first-stage bandwidth.

3.1.3 Asymptotic Properties of the First Stage Estimators

To study the asymptotic properties of the first stage estimators, we need some assumptions.

Assumption 2

Let $\mathcal{T}\mathcal{U}$ be a compact subset of the support of $(T,Y)$ and $\mathcal{X}$ be the support of $X$ .

The sample $\{Y_{i},T_{i},X_{i}\}_{i=1}^{n}$ is i.i.d. 2. 2.

$||\max_{j\leq p}|b_{j}(X)|||_{\mathbb{P},\infty}\leq\zeta_{n}$ * and $\underline{C}\leq\mathbb{E}b_{j}(X)^{2}\leq 1/\underline{C}$ $j=1,\cdots,p.$ * 3. 3.

$\sup_{(t,u)\in\mathcal{T}\mathcal{U}}\max(||\gamma_{t}||_{0},||\beta_{t}||_{0},||\theta_{t,u}||_{0})\leq s$ * for some $s$ which possibly depends on the sample size $n$ .* 4. 4.

$\sup_{t\in\mathcal{T}}||r_{t}^{F}(X)||_{\mathbb{P}_{n},2}=O_{p}((s\log(p\vee n)/(n))^{1/2})$ * and*

[TABLE] 5. 5.

$\sup_{t\in\mathcal{T}}||r_{t}^{F}(X)||_{\mathbb{P},\infty}=O((\log(p\vee n)s^{2}\zeta_{n}^{2}/(n))^{1/2})$ * and*

[TABLE] 6. 6.

$f_{t}(x)$ * is second-order differentiable w.r.t. $t$ with bounded derivatives uniformly over $(t,x)\in\mathcal{TX}$ , where $\mathcal{T}$ is a compact subset of the support of $T$ and $\mathcal{X}$ is the support of $X$ .* 7. 7.

$\zeta_{n}^{2}s^{2}\ell_{n}^{2}\log(p\vee n)/(nh_{1})\rightarrow 0$ , $nh_{1}^{5}/(\log(p\vee n))\rightarrow 0.$

Assumption 2.1 is common for cross-sectional observations. Assumption 2.2 is the same as Assumption 6.1(a) in Belloni et al. (2017a). Assumption 2.3 requires that $\nu_{t}(x)$ , $\phi_{t,u}(x)$ , and $F_{t}(x)$ are approximately sparse, i.e., they can be well-approximated by using at most $s$ elements of $b(x)$ . This approximate sparsity condition is common in the literature on high-dimensional data (see, e.g., Belloni et al. (2017a)). Assumption 2.4 and 2.5 specify how well the approximations are in terms of $L_{\mathbb{P}_{n},2}$ and $L_{\mathbb{P},\infty}$ norms. The exact rate for $r_{t}^{F}(X)$ follows Belloni et al. (2017a). The rates for $r_{t,u}^{\nu}(X)$ and $r_{t,u}^{\phi}(X)$ are different from that for $r_{t}^{F}(X)$ because their approximations are local in $T=t$ . If the models for $\nu_{t}(\cdot)$ , $\phi_{t,u}(\cdot)$ , and $F_{t}(\cdot)$ are correctly specified and exactly sparse, i.e., the coefficients for all but $s$ regressors are zero, then there are no approximate errors. This implies $r_{t}^{F}(\cdot)$ , $r_{t,u}^{\nu}(\cdot)$ , and $r_{t,u}^{\phi}(\cdot)$ equal to zero so that Assumption 2.4 and 2.5 hold automatically. In the sieve estimation, $X$ is finite dimensional and $b(X)$ is just a sequence of sieve bases of $X$ . Then $r_{t}^{F}(\cdot)$ , $r_{t,u}^{\nu}(\cdot)$ , and $r_{t,u}^{\phi}(\cdot)$ are the sieve approximation bias. Assumptions 2.3 and 2.4 can be verified under some smoothness conditions (see, e.g., Chen (2007)). Therefore, Assumption 2.4 and 2.5 are in spirit close to the smoothness condition. Assumption 2.6 is the smoothness of the true density, which is needed for the theoretical analysis of the numerical derivative. Because $\mathcal{T}$ needs not be the whole support of $T$ , this condition is plausible. In a simple case, if $T=\mu(X)+U$ , $|\mu(x)|$ is bounded uniformly over $x\in\mathcal{X}$ , and $U$ is independent of $X$ and logistically distributed, then this condition holds. Assumption 2.7 imposes conditions on the rates at which $s$ , $\zeta_{n}$ , and $p$ grow with sample size $n$ . It ensures that the first stage nuisance parameters are estimated with sufficient accuracy. In particular, we require $s^{2}/(nh_{1})\rightarrow 0$ . Comparing with the condition that $s^{2}/n\rightarrow 0$ imposed in Belloni et al. (2017a), our condition reflects the local nature of our estimation procedure in the sense that our effective sample size is of order of magnitude $nh_{1}$ .

Assumption 3

$K(\cdot)$ * is a symmetric probability density function (PDF) with*

[TABLE]

There exists a positive constant $\overline{C}_{K}$ such that $\sup_{u}u^{l}K\left(u\right)\leq\overline{C}_{K}$ for $l=0,1.$ 2. 2.

There exists some positive constant $\underline{C}<1$ such that $\underline{C}\leq f_{t}(x)\leq 1/\underline{C}$ uniformly over $(t,x)\in\mathcal{T}\mathcal{X}$ . 3. 3.

$\nu_{t}(x)$ * and $\phi_{t,u}(x)$ are three times differentiable w.r.t. $t$ , with all three derivatives being bounded uniformly over $(t,x,u)\in\mathcal{T}\mathcal{X}\mathcal{U}.$ * 4. 4.

For the same $\underline{C}$ as above, $\underline{C}\leq\mathbb{E}(Y_{u}(t)|X=x)\leq 1-\underline{C}$ uniformly over $(t,x,u)\in\mathcal{T}\mathcal{X}\mathcal{U}:=\mathcal{T}\mathcal{X}\times\mathcal{U}$ .

Assumption 3.1 holds for many kernel functions, e.g., uniform and Gaussian kernels. Since $f_{T}(X)$ was referred to as the generalized propensity by Hirano and Imbens (2004), Assumption 3.2 is analogous to the overlapping support condition commonly assumed in the treatment effect literature; see, e.g., Hirano et al. (2003) and Firpo (2007). Since the conditional density also has the sparsity structure as assumed in Assumption 2, at most $s$ members of $X$ ’s affect the conditional density, which makes Assumption 3.2 more plausible. Assumption 3.3 imposes some smoothness conditions that are widely assumed in the nonparametric kernel literature. Assumption 3.4 holds if $\mathcal{XU}$ is compact.

Assumption 4

There exists a sequence $\ell_{n}\rightarrow\infty$ such that, with probability approaching one,

[TABLE]

Assumption 4 is the restricted eigenvalue condition commonly assumed in the high-dimensional data literature. Based on Bickel et al. (2009),

[TABLE]

are the minimal and maximal eigenvalues of Gram submatrices formed by any $s\ell_{n}$ components of $b(X)$ . Because $p\gg n$ , the matrix $b(X)^{\prime}b(X)$ is not invertible. However, because $s\ell_{n}\ll n$ , Assumption 4 implies that the Gram submatrices can still be invertible. We refer interested readers to Bickel et al. (2009) for more details and Bühlmann and van de Geer (2011) for a textbook treatment.

Since there is a kernel in the Lasso objective functions in (3.4) and (3.5), the asymptotic properties of $\hat{\gamma}_{t}$ and $\hat{\theta}_{t,u}$ cannot be established by directly applying the results in Belloni et al. (2017a). The key missing piece is the following local version of the compatibility condition. Let $\mathcal{S}_{t,u}$ be an arbitrary subset of $\{1,\cdots,p\}$ such that $\sup_{(t,u)\in\mathcal{TU}}|\mathcal{S}_{t,u}|\leq s$ and $\Delta_{c,t,u}=\{\delta:||\delta_{\mathcal{S}_{t,u}^{c}}||_{1}\leq c||\delta_{\mathcal{S}_{t,u}}||_{1}\}$ for some $c<\infty$ independent of $(t,u)$ .

Lemma 3.1

If Assumptions 1–4 hold, then there exists $\underline{\kappa}=\kappa^{\prime}\underline{C}^{1/2}/4>0$ such that, w.p.a.1,

[TABLE]

Note $\mathcal{S}_{t,u}$ in Lemma 3.1 is either the support of $\theta_{t,u}$ or the support of $\gamma_{t}$ . For the latter case, the index $u$ is not needed. We refer to Lemma 3.1 as the local compatibility condition because (1) there is a kernel function implementing the localization; and (2) by the Cauchy inequality, Lemma 3.1 implies

[TABLE]

Bickel et al. (2009, Lemma 4.2) show that, under Assumption 4, we have the following compatibility condition:

[TABLE]

which is the key convertibility condition used in high-dimensional analysis. We refer interested readers to Bühlmann and van de Geer (2011, Equation 6.4), the remarks after that, and Bühlmann and van de Geer (2011, Section 6.13) for more detailed discussions and further references. Under Assumption 4 and some regularity conditions assumed in the paper, Lemma 3.1 establishes a local version of (3.7). Based on Lemma 3.1, we can establish the following asymptotic probability bounds for the first stage estimators.

Theorem 3.1

Suppose Assumptions 1–2, 3.1–3.3, and 4 hold. Then

[TABLE]

and $\sup_{t\in\mathcal{T}}||\hat{\gamma}_{t}||_{0}=O_{p}(s)$ . If in addition, Assumption 3.4 holds, then

[TABLE]

and $\sup_{(t,u)\in\mathcal{T}\mathcal{U}}||\hat{\theta}_{t,u}||_{0}=O_{p}(s).$

Several comments are in order. First, due to the nonlinearity of the logistic link function, Assumption 3.4 is needed for deriving the asymptotic properties of the penalized local MLE estimators $\widehat{\phi}_{t,u}(x)$ and $\widetilde{\phi}_{t,u}(x)$ . Second, the $L_{\mathbb{P}_{n},2}$ bounds in Theorem 3.1 are faster than $(nh_{1})^{-1/4}$ by Assumption 5 below. This implies the estimators are sufficiently accurate so that in the second stage, their second and higher order impacts are asymptotically negligible. Last, the numbers of nonzero coordinates of $\hat{\gamma}_{t}$ and $\hat{\theta}_{t,u}$ determine the complexity of our first stage estimators, which are uniformly controlled with a high probability.

For the conditional density estimation, we have the following results.

Theorem 3.2

Suppose Assumptions 1–2, 3.1–3.3, and 4 hold. Then

[TABLE]

and $\sup_{t\in\mathcal{T}}||\hat{\beta}_{t}||_{0}=O_{p}(s).$

The rates of convergence in Theorem 3.2 are the same as those derived in Belloni et al. (2018b, Section 8).

3.2 The Second Stage Estimation

Let $W=\{Y,T,X\}$ and $W_{u}=\{Y_{u},T,X\}$ . For three generic functions $\breve{\nu}(\cdot)$ , $\breve{\phi}(\cdot)$ and $\breve{f}(\cdot)$ of $X$ , denote

[TABLE]

and

[TABLE]

Then the estimators $\hat{\mu}(t)$ and $\hat{\alpha}(t,u)$ can be written as

[TABLE]

where $\overline{\nu}_{t}(\cdot)$ , $\overline{\phi}_{t,u}(\cdot)$ , and $\overline{f}(\cdot)$ are either the Lasso estimators (i.e., $\widehat{\nu}_{t}(\cdot)$ , $\widehat{\phi}_{t,u}(\cdot)$ , and $\hat{f}_{t}(\cdot)$ ) or the post-Lasso estimators (i.e., $\widetilde{\nu}_{t}(\cdot)$ , $\widetilde{\phi}_{t,u}(\cdot)$ , and $\tilde{f}_{t}(\cdot)$ ) as defined in Section 3.1.

Assumption 5

Let $h_{2}=C_{2}n^{-H_{2}}$ for some positive constant $C_{2}$ .

$H_{2}\in[1/5,1/3)$ , $\log^{2}(n)s^{2}\log^{2}(p\vee n)/(nh_{2})\rightarrow 0$ , and $\ell_{n}^{2}s^{2}\log^{2}(p\vee n)/(nh_{1}^{2})\rightarrow 0$ , and $\ell_{n}^{2}s^{2}\log^{2}(p\vee n)h_{2}/(nh_{1}^{3})\rightarrow 0$ . 2. 2.

$H_{2}\in(1/4,1/3)$ , $\log^{2}(n)s^{2}\log^{2}(p\vee n)/(nh_{2}^{2})\rightarrow 0$ , $\ell_{n}^{2}s^{2}\log^{2}(p\vee n)/(nh_{1}^{2}h_{2})\rightarrow 0$ , and $\ell_{n}^{2}s^{2}\log^{2}(p\vee n)/(nh_{1}^{3})\rightarrow 0$ .

Theorem 3.3

Suppose Assumptions 1–4 and 5.1 hold. Then

[TABLE]

and

[TABLE]

where

[TABLE]

$\kappa_{2}=\int u^{2}K(u)du$ , $\sup_{t\in\mathcal{T}}|R_{n}^{\prime}(t)|=o_{p}((nh_{2})^{-1/2})$ and $\sup_{(t,u)\in\mathcal{T}\mathcal{U}}|R_{n}(t,u)|=o_{p}((nh_{2})^{-1/2}).$ If Assumption 5.1 is replaced by Assumption 5.2, then

[TABLE]

Theorem 3.3 presents the Bahadur representations of the nonparametric estimators $\hat{\mu}(t)$ and $\hat{\alpha}(t,u)$ with a uniform control on the remainder terms. For most purposes (e.g., to obtain the asymptotic distributions of these intermediate estimators or to obtain the results below), Assumption 5.1 is sufficient. Occasionally, one needs to impose Assumption 5.2 to have a better control on the remainder terms, say, when one conducts an $L_{2}$ -type specification test. See the remark after Theorem 3.4 below.

3.3 The Third Stage Estimation

Recall that $q_{\tau}(t)$ denotes the $\tau$ -th quantile of $Y(t)$ , which is the inverse of $\alpha(t,u)$ w.r.t. $u$ . We propose to estimate $q_{\tau}(t)$ by $\hat{q}_{\tau}(t)$ where $\hat{q}_{\tau}(t)=\inf\{u:\hat{\alpha}^{r}(t,u)\geq\tau\}$ and $\hat{\alpha}^{r}(t,u)$ is the rearrangement of $\hat{\alpha}(t,u)$ .

We rearrange $\hat{\alpha}(t,u)$ to make it monotonically increasing in $u\in\mathcal{U}$ . Following Chernozhukov et al. (2010), for a generic function $Q(\cdot)$ , we define $\overline{Q}=Q\circ\psi^{\leftarrow}$ where $\psi$ can be any increasing bijective mapping: $\mathcal{U}\mapsto[0,1]$ and $\psi^{\leftarrow}$ is the inverse of $\psi$ . Then the rearrangement $\overline{Q}^{r}$ of $\overline{Q}$ is defined as

[TABLE]

where $F(y)=\int_{0}^{1}1\{\overline{Q}(u)\leq y\}du$ . Then the rearrangement $Q^{r}$ for $Q$ is $Q^{r}=\overline{Q}^{r}\circ\psi(u).$

The rearrangement and inverse are two functionals operating on the process

[TABLE]

and are shown to be Hadamard differentiable by Chernozhukov et al. (2010) and van der Vaart and Wellner (1996), respectively. However, by Theorem 3.3,

[TABLE]

which is not asymptotically tight. Therefore, the standard functional delta method used in Chernozhukov et al. (2010) and van der Vaart and Wellner (1996) is not directly applicable. The next theorem overcomes this difficulty and establishes the linear expansion of the quantile estimator. Denote $\mathcal{TI}$ , $\{q_{\tau}(t):\tau\in\mathcal{I}\}^{\varepsilon}$ , $\overline{\{q_{\tau}(t):\tau\in\mathcal{I}\}^{\varepsilon}}$ , and $\mathcal{U}_{t}$ as $\mathcal{T}\times\mathcal{I}$ , the $\varepsilon$ -enlarged set of $\{q_{\tau}(t):\tau\in\mathcal{I}\}$ , the closure of $\{q_{\tau}(t):\tau\in\mathcal{I}\}^{\varepsilon}$ , and the projection of $\mathcal{TU}$ on $T=t$ , respectively.

Theorem 3.4

Suppose that Assumptions 1–4 and 5.1 hold. If $\overline{\{q_{\tau}(t):\tau\in\mathcal{I}\}^{\varepsilon}}\subset\mathcal{U}_{t}$ for any $t\in\mathcal{T}$ , then

[TABLE]

where $f_{Y(t)}$ is the density of $Y(t)$ , $\mathcal{\beta}_{q}(t,\tau)=\frac{\mathcal{\beta}_{\alpha}(t,q_{\tau}(t))}{f_{Y(t)}(q_{\tau}(t))}$ , and $\sup_{(t,\tau)\in\mathcal{T}\mathcal{I}}R_{n}^{q}(t,\tau)=o_{p}((nh_{2})^{-1/2}).$ If Assumption 5.1 is replaced by Assumption 5.2, then

[TABLE]

Under Assumption 5.2, the remainder term $R_{n}^{q}(t,\tau)$ is $o_{p}(n^{-1/2})$ uniformly in $(t,\tau)\in\mathcal{T}\mathcal{I}$ . This result is needed if one wants to establish an $L_{2}$ -type specification test of $q_{\tau}(t)$ . For example, one may be interested in testing the null hypotheses of the quantile partial derivative being homogeneous across treatment. In this case, the null hypothesis can be written as

[TABLE]

and the alternative hypothesis is the negation of $H_{0}$ . One way to conduct a consistent test for the above hypothesis is to employ the residuals of the linear regression of $\hat{q}_{\tau}(T_{i})$ on $T_{i}$ to construct the test statistic $\Upsilon_{n}(\tau)$ , i.e.,

[TABLE]

where $(\hat{\beta}_{0},\hat{\beta}_{1})$ are the linear coefficient estimators. This type of specification test has been previously studied by Su and Chen (2013), Lewbel et al. (2015), Su et al. (2015), Hoderlein et al. (2016), and Su and Hoshino (2016) in various contexts. One can follow them and apply the results in Theorem 3.4 to study the asymptotic distribution of $\Upsilon_{n}(\tau)$ for each $\tau.$ In addition, one can also consider either an integrated or a sup-version of $\Upsilon_{n}(\tau)$ and then study its asymptotic properties. For brevity we do not study such a specification test in this paper.

Given the estimators $\hat{\mu}(t)$ and $\hat{q}_{\tau}(t)$ , we can run local linear regressions of $\hat{\mu}(T_{i})$ and $\hat{q}_{\tau}(T_{i})$ on $\left(1,T_{i}-t\right)$ and obtain estimators $\breve{\beta}^{1}(t)$ and $\hat{\beta}_{\tau}^{1}(t)$ of $\partial\mu(t)$ and $\partial_{t}q_{\tau}(t)$ , respectively, as estimators of the linear coefficients in the local linear regression.666Alternatively, one can consider the local quadratic or cubic regression. Specifically, we define

[TABLE]

and

[TABLE]

where $h_{2}$ is the second-stage bandwidth. It is possible to use a third bandwidth $h_{3}$ in this step. Results similar to Theorem 3.5 below still holds if $h_{3}/h_{2}=O(1)$ . Note that the usual optimal bandwidth for the kernel estimator of the derivative is $O(n^{-1/7})$ . However, because $h_{2}=O(n^{-1/5})$ , the requirement that $h_{3}/h_{2}=O(1)$ implies the optimal bandwidth is not achievable. The key reason is that, unlike the usual local linear regression, we need to plug in the estimates of $\mu(\cdot)$ and $q_{\tau}(\cdot)$ . For simplicity, we just take $h_{3}=h_{2}.$

The following theorem shows the asymptotic properties of $\breve{\beta}^{1}(t)$ and $\hat{\beta}_{\tau}^{1}(t)$ .

Theorem 3.5

Suppose Assumptions 1–4, and 5.1. If $\overline{\{q_{\tau}(t):\tau\in\mathcal{I}\}^{\varepsilon}}\subset\mathcal{U}_{t}$ for any $t\in\mathcal{T}$ , then

[TABLE]

and

[TABLE]

where $\sup_{t\in\mathcal{T}}|\breve{R}_{n}^{1}(t)|+\sup_{(t,\tau)\in\mathcal{T}\mathcal{I}}|R_{n}^{1}(t,\tau)|=o_{p}((nh_{2}^{3})^{-1/2})$ and $\overline{K}(v)=\int wK(v-w)K(w)dw$ .

Theorem 3.5 presents the Bahadur representations for $\breve{\beta}^{1}(t)$ and $\hat{\beta}_{\tau}^{1}(t).$ Since they are estimators for the first order derivatives $\partial_{t}\mu(t)$ and $\partial_{t}q_{\tau}(t),$ respectively, we can show that they converge to the true values at the $\left(nh_{2}^{3}\right)^{1/2}$ -rate. Such a rate is common for kernel estimations of the first-order derivative of the conditional expectation, i.e., Li and Racine (2007, Theorem 2.10).

4 Inference

In this section, we study the inference for $\mu(t),$ $q_{\tau}(t),$ and $\partial_{t}q_{\tau}(t).$ We follow the lead of Belloni et al. (2017a) and consider the weighted-bootstrap inference. Let $\{\eta_{i}\}_{i=1}^{n}$ be a sequence of i.i.d. random variables generated from the distribution of $\eta$ such that it has sub-exponential tails and unit mean and variance.777 A random variable $\eta$ has sub-exponential tails if $P(|\eta|>x)\leq K\exp(-Cx)$ for every $x$ and some constants $K$ and $C$ . For example, $\eta$ can be a standard exponential random variable or a normal random variable with unit mean and standard deviation. We conduct the bootstrap inference based on the following procedure.

Obtain $\widehat{\nu}_{t}(x)$ , $\widehat{\phi}_{t,u}(x)$ , $\hat{f}_{t}(x)$ , $\widetilde{\nu}_{t}(x)$ , $\widetilde{\phi}_{t,u}(x)$ and $\tilde{f}_{t}(x)$ from the first stage. 2. 2.

For the $b$ -th bootstrap sample:

•

Generate $\{\eta_{i}\}_{i=1}^{n}$ from the distribution of $\eta$ .

•

Compute

[TABLE]

and

[TABLE]

where $(\overline{\phi}_{t,u}(\cdot),\overline{f}_{t}(\cdot))$ are either $(\widehat{\phi}_{t,u}(\cdot),\hat{f}_{t}(\cdot))$ or $(\widetilde{\phi}_{t,u}(\cdot),\tilde{f}_{t}(\cdot))$ .

•

Rearrange $\hat{\alpha}^{b}(t,u)$ and obtain $\hat{\alpha}^{br}(t,u)$ .

•

Invert $\hat{a}^{br}(t,u)$ w.r.t. $u$ and obtain $\hat{q}^{b}_{\tau}(t)=\inf\{u:\hat{a}^{br}(t,u)\geq\tau\}$ .

•

Compute $\breve{\beta}^{b1}(t)$ and $\hat{\beta}_{\tau}^{b1}(t)$ as the slope coefficients of local linear regressions of $\eta_{i}\hat{\mu}^{b}(T_{i})$ on $(\eta_{i},\eta_{i}(T_{i}-t))$ and $\eta_{i}\hat{q}^{b}_{\tau}(T_{i})$ on $(\eta_{i},\eta_{i}(T_{i}-t))$ , respectively. 3. 3.

We repeat the above step for $b=1,\cdots,B$ and obtain a bootstrap sample of

[TABLE] 4. 4.

Obtain $\widehat{Q}^{\mu}(\alpha)$ , $\widehat{Q}^{0}(\alpha)$ , $\widehat{Q}^{\mu 1}(\alpha)$ , and $\widehat{Q}^{1}(\alpha)$ as the $\alpha$ -th quantile of the sequences $\{\hat{\mu}^{b}(t)-\hat{\mu}(t)\}_{b=1}^{B}$ , $\{\hat{q}_{\tau}^{b}(t)-\hat{q}_{\tau}(t)\}_{b=1}^{B}$ , $\{\breve{\beta}^{b1}(t)-\breve{\beta}^{1}(t)\}_{b=1}^{B}$ , and $\{\hat{\beta}_{\tau}^{b1}(t)-\hat{\beta}_{\tau}^{1}(t)\}_{b=1}^{B}$ , respectively.

The standard $100(1-\alpha)\%$ percentile bootstrap confidence interval for $q_{\tau}(t)$ is

[TABLE]

However, in our simulation study, we find that it slightly undercovers. Instead, we use the fact that normal CDF is symmetric and propose to use the modified percentile bootstrap confidence interval as follows:

[TABLE]

where $\hat{Q}^{*0}(\alpha/2)=(-\hat{Q}^{0}(\alpha/2))\vee\hat{Q}^{0}(1-\alpha/2)$ . We define $\widehat{Q}^{*\mu}(\alpha/2)$ , $\widehat{Q}^{*\mu 1}(\alpha/2)$ , and $\widehat{Q}^{*1}(\alpha/2)$ in the same manner. The following theorem summarizes the main results in this section.

Theorem 4.1

Suppose that Assumptions 1–4 and 5.1 hold and $nh_{2}^{5}\rightarrow 0$ . Then

[TABLE]

and

[TABLE]

Theorem 4.1 implies that, via under-smoothing, the $100(1-\alpha)\%$ bootstrap confidence intervals for $\mu(t),$ $q_{\tau}(t),$ $\partial_{t}\mu(t)$ , and $\partial_{t}q_{\tau}(t)$ have the correct asymptotic coverage probability $1-\alpha.$ We need to under-smooth because, regardless of under-smoothing, the bootstrap estimator is always center around the original estimator without the asymptotic bias. With more complicated notations and the arguments of strong approximation in Chernozhukov et al. (2014b) and Chernozhukov et al. (2014a), one can show that the validity of bootstrap inference holds uniformly over $\left(t,\tau\right).$ One of the key ingredients to verify Chernozhukov et al. (2014a, Condition H1) is the linear expansions of the estimators with a uniform control of the reminder terms, which has already been established in Theorems 3.4 and 3.5.

5 Monte Carlo Simulations

This section presents the results of Monte Carlo simulations, which demonstrate the finite sample performance of the estimation and inference procedure. Let $Y$ be generated as

[TABLE]

while $T$ be generated as

[TABLE]

where $U$ and $V$ are two standard logistic random variables such that $U\perp V$ and $(U,V)\perp X$ , $\Lambda(\cdot)$ and $\Phi(\cdot)$ are the logistic and normal CDFs, respectively, $p=100$ , $X$ is a $p$ -dimensional random variables whose distribution is the Gaussian copula with covariance parameter $[{0.5^{|j-k|}}]_{jk}$ , and $b(X)$ is a vector of basis functions constructed from $X$ . Note that $T$ ranges from [math] to $1$ . The parameters of interest are $q_{\tau}(t)$ and $\partial_{t}q_{\tau}(t)$ , where $t=0.25,0.5,0.75$ and $\tau\in(0.2,0.8)$ . We consider the following three designs:

(Exact sparse) $\beta_{j}=\frac{\pi^{2}}{24}$ for $j=1,\cdots,4$ $\beta_{j}=0$ , $j\geq 5$ , and $b(X_{j})=X_{j}$ , $j=1,\cdots,100$ ; 2. 2.

(Approximate sparsity) $\beta_{j}=\frac{1}{j^{2}}$ for $j=1,\cdots,100$ and $b(X_{j})=X_{j}$ , $j=1,\cdots,100$ ; 3. 3.

(Sieve basis) $\beta_{1}=\beta_{2}=\frac{\pi^{2}}{12}$ and $\beta_{j}=0$ , $j\geq 3$ . We construct $b(X)$ as the cubic spline basis functions of $(X_{1},X_{2})$ :

[TABLE]

where $q^{(j)}(\tau)$ denotes the $\tau$ -th empirical quantile of $X_{j}$ , $j=1,2$ . This results in 169 basis functions. We further remove the basis functions with variance less than $10^{-4}$ . We end up with about 128 basis functions on average.888The number of basis functions slightly varies across simulations.

Note that the sum of the coefficients are (approximately) $\pi^{2}/6$ for all three designs. We normalize the basis functions $b(X)$ by their sample means and standard errors.

We use Gaussian kernel function in all three stages. We have four tuning parameters: $\lambda$ , $\tilde{\lambda}$ , $h_{1}$ , and $h_{2}$ . As we discussed in Section 3.1, we use

[TABLE]

where $\ell_{n}=\sqrt{\log(\log(nh_{1}))}$ and $\gamma=1/\log(n)$ . We use the rule-of-thumb bandwidth for $h_{1}$ , i.e., $h_{1}=h^{*}=1.06\times sd(T)\times n^{-1/5}$ . Last, we build $h_{2}$ based on the rule-of-thumb bandwidth for the local quantile regression suggested by Yu and Jones (1998). In particular, Yu and Jones (1998) propose the bandwidths $h_{RoT}(\tau)=C(\tau)\times h_{mean}$ , where $C(\tau)$ is a constant dependent only on $\tau$ , and $C(0.5)=1.095$ and $C(0.25)=C(0.75)=1.13$ and $h_{mean}$ is the bandwidth for the kernel estimation of $\mathbb{E}(Y|T)$ .999We refer interested readers to (Yu and Jones, 1998, Table 1) for more details on $C(\tau)$ . In our simulation studies, as $C(\tau)$ is nearly constant over $\tau\in[0.25,0.75]$ , we just choose $C(0.5)=1.095$ for all the quantile index $\tau$ . We use the leave-one-out cross-validation to search for the optimal bandwidth of $h_{mean}$ over a grid in $(0.8h^{*},1.2h^{*})$ . The resulting bandwidth is denoted as $h_{mean}^{*}$ . In order to achieve under-smoothing, we define $h_{2}=n^{-1/10}\times C(\tau)\times h_{mean}^{*}$ , where our choice of the factor $n^{-1/10}$ follows Cai and Xiao (2012, p.418).

We repeat the bootstrap inference 500 times and all the results are based on 500 Monte Carlo simulations. The sample size is $n=500$ . Although the sample size is large compared to $p$ , in this DGP, the first-stage bandwidth is as small as $0.09$ . The effective sample size for the first-stage estimation is of order of magnitude of $nh_{1}\approx 45<100$ . In fact, we obtained warning signs of potential multi-collinearity and were unable to estimate the model when implementing the traditional estimation procedures without variable selection (i.e., without penalization).

The upper-left subplots in Figures 1, 4, 7 and 2, 5, 8 report the true functions of $q_{\tau}(t)$ and $\partial_{t}q_{\tau}(t)$ for $t=0.25,0.5,0.75$ , $\tau\in(0.2,0.8)$ and DGP 1, 2, and 3, respectively. Both $q_{\tau}(t)$ and $\partial_{t}q_{\tau}(t)$ are heterogeneous across $\tau$ and $t$ , which imposes difficulties for estimation and inference. The rest of the subplots in the above Figures show the estimation biases and standard errors. We observe that all the biases of our estimators are of smaller order of magnitude than the standard error (std) and the root mean squared error (rMSE), which indicates the doubly robust moments effectively remove the selection bias induced by the Lasso method. The estimators of the quantile functions are very accurate. The estimators of the quantile partial derivatives are less so because they have slower convergence rates. Figures 3, 6, and 9 show that the 90% point-wise modified percentile bootstrap confidence intervals have reasonable performance for both the quantile functions and their derivatives, across all $\tau$ and $t$ values considered, with slight over-coverage for the quantile derivative functions. The results of variable selections depend on the values of $t$ and $(t,u)$ for conditional density estimation and penalized local MLE, respectively, which are tedious to report, Thus, they are omitted for brevity. Overall, 2 to 4 covariates are selected.

In Section D in the Appendix, we report the performance of oracle estimators for the three designs, in which oracle estimators are computed using the true conditional CDF and density functions. We also report the finite-sample performance of our mean potential outcome (i.e., $\mathbb{E}(Y(t))$ ) estimators, which is similar to that of the quantile effect estimates reported here. Last, we consider an extra design in which the approximate sparsity condition may be violated and show that our method breaks down. We use this design to illustrate the limitation of our method.

6 Empirical Illustration

To investigate our proposed estimation and inference procedures, we use the 1979 National Longitudinal Survey of Youth (NLSY79) and consider the effect of father’s income on son’s income in the presence of many control variables. Our analysis is based on Bhattacharya and Mazumder (2011). The data consist of a nationally representative sample of individuals with age 14-22 years old as of 1979. We use only white and black males and discard the individuals with missing values in the covariates we use. The resulting sample size is 1,795, out of which 1,302 individuals are white and 493 individuals are black.

The treatment variable of interest is the logarithm of father’s income, in which father’s income is computed as the average family income for 1978, 1979, and 1980. The outcome variable is the logarithm of son income, in which son income is computed as the average family income for 1997, 1999, 2001 and 2003. We create control variables by interacting a list of demographic variables with the cubic splines of the AFQT score and the years of education.101010The cubic splines for the AFQT score are constructed based on the normalized value by scaling the raw AFQT score into [0,1], where the knots are taken at the quantiles of the normalized AFQT score at $10\%,20\%,\ldots,90\%$ . The cubic splines for the years of education are constructed in the same way. In this exercise, we do not interact the cubic splines for the AFQT score and the years of education. The list includes the age, the mother’s education level, the father’s education level, the indicators of (i) living in urban areas at age 14, (ii) living in the south, (iii) speaking a foreign language at childhood, and (iv) being born outside the U.S. We drop the variables whose variance is less than $10^{-4}$ . The resulting numbers of control variables are 120 for whites and 145 for blacks.

We apply the proposed estimation and inference procedures for black and white individuals separately. We use the same tuning parameter choices as in the previous section.111111In Section E in the Appendix, we investigate the sensitivity of our estimation method with respect to the tuning parameters. As a result, our effective sample sizes are of orders of magnitude $nh_{1}\approx 462$ and $175$ for whites and blacks, respectively. Figures 10 and 11 show the estimated unconditional quantile functions and the estimated derivative, as well as the point-wise 90% confidence bands for $\tau\in[0.2,0.8]$ and $t$ taking values at the $25\%$ , $50\%$ , and $75\%$ quantiles of the empirical distribution of $T_{i}$ . Under the context of intergenerational income mobility, the unconditional quantile and its derivative represent the quantile of son’s potential log income indexed by father’s log income and the intergenerational elasticity, respectively. The unconditional quantile functions have a slight upward trend and the estimated derivative is positive in most parts of father’s log income. The confidence bands for the unconditional quantile functions are quite narrow for both black and white individuals. For white individuals with the values of father’s log income at the $50\%$ or $75\%$ quantile, we can reject the (locally) zero intergenerational elasticity for most of the values of $\tau\in[0.2,0.8]$ . For the other cases, we cannot reject the (locally) zero intergenerational elasticity for almost all $\tau$ ’s. This is considered as the cost of our fully nonparametric specification.

It is worthwhile to mention the variable selection in this application. the years of education, the AFQT score, the age, the father’s education level, and the mother’s education level are the leading control variables selected.121212More precisely, for whites, $dad\_educ*afqt$ and $mom\_educ$ are the two most selected control variables for the density estimations. $age*educ$ and $age*afqt$ are the two most selected control variables for the penalized local MLE. For blacks, $mom\_educ$ and $dad\_educ*educ$ are the two most selected control variables for the density estimations. $educ$ and $age*afqt$ are the two most selected control variables for the penalized local MLE.

7 Conclusion

This paper studies non-separable models with a continuous treatment and high-dimensional control variables. It extends the existing results on the causal inference in non-separable models to the case with both continuous treatment and high-dimensional covariates. It develops a method based on localized $L_{1}$ -penalization to select covariates at each value of the continuous treatment. It then proposes a multi-stage estimation and inference procedure for average, quantile, and marginal treatment effects. The simulation and empirical exercises support the theoretical findings in finite samples.

Appendix

Appendix A Proof of the Main Results in the Paper

Before proving the theorem, we first introduce some additional notation and Assumption 6, which is a restatement of Sasaki (2015, Assumptions 1 and 2) in our framework. Denote by $\dim_{X}$ (resp. $\dim_{A}$ ) the dimensionality of $X$ (resp. $A$ ). We define $\partial V(y,t)=\{(x,a):\Gamma(t,x,a)=y\}$ and $\partial V(y,t)$ can be parametrized as a mapping from a $(\dim_{X}+\dim_{A}-1)$ -dimensional rectangle, denoted by $\Sigma$ , to $\partial V(y,t)$ . $H^{\dim_{X}+\dim_{A}-1}$ is the $(\dim_{X}+\dim_{A}-1)$ -dimensional Hausdorff measure restricted from $\mathbb{R}^{\dim_{X}+\dim_{A}}$ to $(\partial V(y,t),\mathcal{B}(y,t))$ , where $\mathcal{B}(y,t)$ is the set of the interactions between $\partial V(y,t)$ and a Borel set in $\mathbb{R}^{\dim_{X}+\dim_{A}}$ . $\partial v(y,\cdot;u)/\partial y$ (resp. $\partial v(\cdot,t;u)/\partial t$ ) is the velocity of $\partial V(y,t)$ at $u$ with respect to $y$ (resp. $t$ ).

Assumption 6

$\Gamma$ * is continuously differentiable.* 2. 2.

$\|\nabla_{(x,a)}\Gamma(t,\cdot,\cdot)\|\neq 0$ * on $\partial V(y,t)$ .* 3. 3.

The conditional distribution of $(X,A)$ given $T$ is absolutely continuous with respect to the Lebesgue measure, and $f_{(X,A)\mid T}$ is a continuously differentiable function of $\mathcal{T}$ to $L^{1}(\mathbb{R}^{\dim_{X}+\dim_{A}})$ . 4. 4.

$\int_{\partial V(y,t)}f_{(X,A)\mid T}(x,a\mid t)dH^{\dim_{X}+\dim_{A}-1}(x,a)>0$ . 5. 5.

$t\mapsto\partial V(y,t)$ * is a continuously differentiable function of $\Sigma\times\mathcal{T}$ to $\mathbb{R}^{\dim_{X}+\dim_{A}}$ for every $y$ and $y\mapsto\partial V(y,t)$ is a continuously differentiable function of $\Sigma\times\mathcal{Y}$ to $\mathbb{R}^{\dim_{X}+\dim_{A}}$ for every $t$ .* 6. 6.

The mapping $\partial v(y,\cdot;\cdot)/\partial t$ is a continuously differentiable function of $\mathcal{T}$ to $\mathbb{R}^{\dim_{X}+\dim_{A}}$ and $\partial v(\cdot,t;\cdot)/\partial y$ is a continuously differentiable function of $\mathcal{Y}$ to $\mathbb{R}^{\dim_{X}+\dim_{A}}$ . 7. 7.

There is $p,q\geq 1$ with $\frac{1}{p}+\frac{1}{q}=1$ such that the mapping $(x,a)\mapsto\|\nabla_{(x,a)}\Gamma(t,x,a)\|^{-1}$ is bounded in $L^{p}(\partial V(y,t),H^{\dim_{X}+\dim_{A}-1})$ and that the mapping $(x,a)\mapsto f_{(X,A)}(x,a)$ is bounded in $L^{q}(\partial V(y,t),H^{\dim_{X}+\dim_{A}-1})$ .

Assumption 6 is a combination of Assumptions 1 and 2 in Sasaki (2015). We refer the readers to the paper for detailed explanation.

Proof of Theorem 2.1. For the marginal distribution of $Y(t)$ , we note that, by Assumption 1, $\mathbb{P}(Y(t)\leq u)=\mathbb{E}[\mathbb{E}(1\{Y(t)\leq u\}|X)]=\mathbb{E}[\mathbb{E}(1\{Y(t)\leq u\}|X,T=t)]=\mathbb{E}[\mathbb{E}(1\{Y\leq u\}|X,T=t)].$ The first result follows as $\mathbb{E}(1\{Y\leq u\}|X,T=t)$ is identified.

For the second result, consider a random variable $T^{\ast}$ which has the same marginal distribution as $T$ and is independent of $(X,A)$ . Define

[TABLE]

Note that the (i) $(X,A)$ and $T^{\ast}$ are independent, and (ii) the $\tau$ -th quantile of $Y^{\ast}$ given $T^{\ast}=t$ is $q_{\tau}(t)$ for all $t$ , because $\mathbb{P}(Y^{\ast}\leq q_{\tau}(t)\mid T^{\ast}=t)=\mathbb{P}(\Gamma(t,X,A)\leq q_{\tau}(t))=\tau$ . Assumption 6 implies Assumptions 1 and 2 in Sasaki (2015) for $(Y^{\ast},T^{\ast},U^{\ast})$ with $U^{\ast}=(X,A)$ , and then his Theorem 1 implies that the derivative of the $\tau$ -th quantile of $Y^{\ast}$ given $T^{\ast}=t$ is equal to $\mathbb{E}_{\mu_{\tau,t}}[\partial_{t}\Gamma(t,X,A)]$ . Therefore, $\partial_{t}q_{\tau}(t)=\mathbb{E}_{\mu_{\tau,t}}[\partial_{t}\Gamma(t,X,A)]$ . Note that Theorem 1 in Sasaki (2015) does not apply directly to $(Y,T,U^{\ast})$ , because our assumptions do not imply that $T$ and $U^{\ast}$ are independent.

Lemma 3.1 is the local version of the compatibility condition, which is one of the key building blocks for Lemma A.1. Then, Lemma A.1 is used to prove Theorem 3.1.

Proof of Lemma 3.1. By Assumption 4, we can work on the set

[TABLE]

We use the same partition as in Bickel et al. (2009). Let $\mathcal{S}_{0}=\mathcal{S}_{t,u}$ and $m\geq s$ be an integer which will be specified later. Partition $\mathcal{S}_{t,u}^{c}$ , the complement of $\mathcal{S}_{t,u}$ , as $\sum_{l=1}^{L}\mathcal{S}_{l}$ such that $|\mathcal{S}_{l}|=m$ for $1\leq l<L$ , $|\mathcal{S}_{L}|\leq m$ , where $\mathcal{S}_{l}$ , for $l<L$ , contains the indexes corresponding to $m$ largest coordinates (in absolute value) of $\delta$ outside $\cup_{j=0}^{l-1}\mathcal{S}_{j}$ , and $\mathcal{S}_{L}$ collects the remaining indexes. Further denote $\delta_{j}=\delta_{\mathcal{S}_{j}}$ and $\delta_{01}=\delta_{\mathcal{S}_{0}\cup\mathcal{S}_{1}}$ . Then

[TABLE]

For the first term on the right hand side (r.h.s.) of (A.2), we have

[TABLE]

where the second inequality holds because

[TABLE]

We next bound the last term on the r.h.s. of (A.2). The second term can be bounded in the same manner. Let $\tilde{\delta}_{01}=\delta_{01}/||\delta_{01}||_{2}$ . Then we have

[TABLE]

Let $\{\eta_{i}\}_{i=1}^{n}$ be a sequence of Rademacher random variables which is independent of the data and $\mathcal{F}=\{b(X)^{\prime}\delta K(\frac{T-t}{h_{1}})^{1/2}:||\delta||_{0}=m+s,||\delta||_{2}=1,t\in\mathcal{T}\}$ with envelope $F=\overline{C}_{K}\zeta_{n}(m+s)^{1/2}$ . Denote $\pi_{1n}$ as $(\frac{\log(p\vee n)(s+m)^{2}\zeta_{n}^{2}}{nh_{1}})^{1/2}$ with $m=s\ell_{n}^{1/2}$ . Then,

[TABLE]

where the first inequality is by van der Vaart and Wellner (1996, Lemma 2.3.1), the second inequality is by Ledoux and Talagrand (2013, Theorem 4.12) and the remark thereafter, and the third one is by applying Corollary 5.1 of Chernozhukov et al. (2014b) with $\sigma^{2}=\sup_{f\in\mathcal{\ F}}\mathbb{E}f^{2}\lesssim h_{1}$ and, for some $A\geq e$ ,

[TABLE]

By Assumption 2, $\pi_{1n}\rightarrow 0.$ Then we have, w.p.a.1.,

[TABLE]

By the same token we can show that

[TABLE]

Therefore, we have, w.p.a.1.,

[TABLE]

Combining (A.3), (A.4), and (A.5) yields that w.p.a.1.,

[TABLE]

Analogously, we can show that, w.p.a.1,

[TABLE]

Following (A.2), we have, w.p.a.1,

[TABLE]

where the second inequality holds because, by construction, $||\delta_{l}||_{2}^{2}\leq||\delta_{l-1}||_{1}||\delta_{l}||_{1}/\sqrt{m}.$ Since $m=s\ell_{n}^{1/2}$ , $s/m=\ell_{n}^{-1/2}\rightarrow 0$ , and thus, for $n$ large enough, the constant inside the brackets is greater than $\kappa^{\prime}\underline{C}^{1/2}/4$ which is independent of $(t,u,n)$ . Therefore, we can conclude that, for $n$ large enough,

[TABLE]

This completes the proof of the lemma.

We aim to prove the results with regard to $\widehat{\phi}_{t,u}(X)$ and $\hat{\theta}_{t,u}$ in Theorem 3.1. The derivations for the results regarding $\widetilde{\phi}_{t,u}(X)$ and $\widetilde{\theta}_{t,u}$ are exactly the same. We do not need to deal with the nonlinear logistic link function when deriving the results regarding $\widehat{\nu}_{t}(X)$ , $\widetilde{\nu}_{t}(X)$ , $\hat{\gamma}_{t}$ , and $\tilde{\gamma}_{t}$ . Therefore, the corresponding results can be shown by following the same proving strategy as below and treating $\omega_{t,u}$ defined below as $1$ . The proofs for results regarding $\widehat{\nu}_{t}(X)$ , $\widetilde{\nu}_{t}(X)$ , $\hat{\gamma}_{t}$ , and $\tilde{\gamma}_{t}$ are omitted for brevity.

Let $\tilde{r}_{t,u}^{\phi}=\Lambda^{-1}(\mathbb{E}(Y_{u}|X,T=t))-b(X)^{\prime}\theta_{t,u}$ , $\delta_{t,u}=\hat{\theta}_{t,u}-\theta_{t,u}$ , $\hat{s}_{t,u}=||\hat{\theta}_{t,u}||_{0}$ , $\omega_{t,u}=\mathbb{E}(Y_{u}(t)|X)(1-\mathbb{E}(Y_{u}(t)|X))$ , and $\widehat{\mathcal{S}}_{t,u}$ be the support of $\widehat{\theta}_{t,u}$ . We need the following four lemmas, whose proofs are relegated to the online supplement.

Lemma A.1

If Assumptions 1–4 hold, then

[TABLE]

and

[TABLE]

Lemma A.2

Suppose Assumptions 1–4 hold. Let $\xi_{t,u}=Y_{u}-\phi_{t,u}(X).$ Then

[TABLE]

Lemma A.3

If the assumptions in Theorem 3.1 hold, then there exists a constant $C_{\psi}\in(0,1)$ such that w.p.a.1,

[TABLE]

For any $k=0,1,\cdots,K$ and $\widehat{\Psi}_{t,u}^{k}$ defined in Algorithm 2, there exists a constant $C_{k}\in(0,1)$ such that, w.p.a.1,

[TABLE]

In addition, for any $k=0,1,\cdots,K$ and $\widehat{\Psi}_{t,u}^{k}$ defined in Algorithm 2, there exist constants $l<1<L$ independent of $n$ , $(t,u)$ , and $k$ such that, element-wise and w.p.a.1,

[TABLE]

Lemma A.4

If the assumptions in Theorem 3.1 hold, then w.p.a.1,

[TABLE]

Proof of Theorem 3.1. By the mean value theorem, there exist $\underline{\theta}_{t,u}\in(\theta_{t,u},\hat{\theta}_{t,u})$ and $\overline{r}_{t,u}^{\phi}\in(0,\tilde{r}_{t,u}^{\phi})$ such that

[TABLE]

where $\delta_{t,u}=\hat{\theta}_{t,u}-\theta_{t,u}.$ By the proof of Lemma A.1, we have, w.p.a.1,

[TABLE]

Therefore, by Lemma A.1 and Assumptions 4 and 5, we have

[TABLE]

where the last equality is because $\sup_{(t,u)\in\mathcal{T}\mathcal{U}}||\delta_{t,u}||_{1}=O_{p}((\log(p\vee n)s^{2})^{1/2}(nh_{1})^{-1/2})$ by Lemma A.1 and $\log(p\vee n)s^{2}\zeta_{n}^{2}/(nh_{1})\rightarrow 0$ by Assumption 5. In addition, under Assumption 3.4 we have

[TABLE]

Hence, there exist some positive constants $c$ and $c^{\prime}$ only depending on $\underline{C}$ such that, w.p.a.1,

[TABLE]

and uniformly over $(t,u)\in\mathcal{T}\mathcal{U}$ ,

[TABLE]

By Assumptions 3.3, 3.4, Lemma A.1, and the fact that $\omega_{t,u}$ is bounded and bounded away from zero uniformly over $\mathcal{TU}$ , we have, w.p.a.1,

[TABLE]

and

[TABLE]

Next, recall that $\lambda=\ell_{n}(\log(p\vee n)nh)^{1/2}$ . By the first order conditions (FOC), for any $j\in\widehat{\mathcal{S}}_{t,u}$ , we have

[TABLE]

Denote $\xi_{t,u}=Y_{u}-\phi_{t,u}(X)$ . By Lemmas A.1, A.2 and A.8, for any $\varepsilon>0$ , with probability greater than $1-\varepsilon$ , there exist positive constants $C_{\lambda}$ and $C$ , which only depend on $\varepsilon$ and are independent of $(t,u,n)$ , such that

[TABLE]

where $\phi_{max}(s)=\sup_{||\theta||_{0}\leq s,||\theta||_{2}=1}||b(X)^{\prime}\theta K(\frac{T-t}{h_{1}})^{1/2}||_{\mathbb{P}_{n},2}^{2}$ and $r_{t,u}^{\phi}=r_{t,u}^{\phi}(X)$ . This implies that there exists a constant $C$ only depending on $\varepsilon$ , such that, with probability greater than $1-\varepsilon$ ,

[TABLE]

Let $\mathcal{M}=\{m\in\mathbb{Z}:m>2Cs\phi_{max}(m)/h_{1}\}$ . We claim that, for any $m\in\mathcal{M}$ , $\hat{s}_{t,u}\leq m$ . Suppose not and there exists $m_{0}\in\mathcal{M}$ such that $m_{0}<\hat{s}_{t,u}$ . Then,

[TABLE]

where the second inequality holds because of Belloni and Chernozhukov (2011, Lemma 23), the third inequality holds because $\lceil a\rceil\leq 2a$ for any $a>1$ , and the last inequality holds because $m_{0}\in\mathcal{M}$ . Therefore we reach a contradiction. In addition, by Lemma A.4, we can choose $C_{s}>4C\underline{C}^{-1}(\kappa^{\prime\prime})^{2}$ , which is independent of $(t,u,n)$ , such that

[TABLE]

This implies $C_{s}s\in\mathcal{M}$ and thus with probability greater than $1-\varepsilon$ , $\hat{s}_{t,u}\leq C_{s}s$ . This result holds uniformly over $(t,u)\in\mathcal{T}\mathcal{U}$ .

Last, we show that

[TABLE]

Let $\varepsilon_{n}=(\log(p\vee n)s/(nh_{1}))^{1/2}$ , $\delta_{n}=(\log(p\vee n)s^{2}\zeta_{n}^{2}/(nh_{1}))^{1/2}$ , and

[TABLE]

By (A), (A), and (A.13), for any $\varepsilon>0$ , there exists a constant $M$ such that, with probability greater than $1-\varepsilon$ , $\widehat{\phi}_{t,u}(\cdot)\in\mathcal{J}_{t,u}$ uniformly in $(t,u)\in\mathcal{T}\mathcal{U}$ . Therefore, with probability greater than $1-\varepsilon$ ,

[TABLE]

where $\mathcal{F}=\left\{(J(X)-\phi_{t,u}(X))^{2}\biggl{[}K(\frac{T-t}{h_{1}})-\mathbb{E}(K(\frac{T_{i}-t}{h_{1}})|X)\biggr{]}:J\in\mathcal{J}_{t,u},(t,u)\in\mathcal{TU}\right\}$ with bounded envelope. Note that,

[TABLE]

In addition, we note that $\mathcal{F}$ is nested by

[TABLE]

where

[TABLE]

Therefore, by Chernozhukov et al. (2014b, Corollary 5.1), we have

[TABLE]

Therefore,

[TABLE]

where the last equality holds due to (A) and (A.14). Canceling the $h_{1}$ ’s on both sides, we obtain the desired the result.

Proof of Theorem 3.2. By Belloni et al. (2017a, Theorem 6.2), we have

[TABLE]

and

[TABLE]

Then, we have

[TABLE]

and similarly,

[TABLE]

Proof of Theorem 3.3. Let $\hat{\alpha}^{\dagger}(t,u)=\mathbb{P}_{n}\eta\Pi_{t,u}(W_{u},\widehat{\phi}_{t,u},\hat{f}_{t})$ where either $\eta=1$ or $\eta$ is a random variable that has sub-exponential tails with unit mean and variance. When $\eta=1$ , $\hat{\alpha}^{\dagger}(t,u)=\hat{\alpha}(t,u)$ , which is our original estimator. When $\eta$ is random, for $\bar{\eta}=\sum_{i=1}^{n}\eta_{i}/n$ ,

[TABLE]

is the bootstrap estimator. In the following, we establish the linear expansion of $\hat{\alpha}^{\dagger}(t,u)$ .

Recall $\varepsilon_{n}=(\log(p\vee n)s/(nh_{1}))^{1/2}$ and $\delta_{n}=(\log(p\vee n)s^{2}\zeta_{n}^{2}/(nh_{1}))^{1/2}.$ By Theorem 3.1 and 3.2, for any $\varepsilon>0$ , there exists a constant $M$ such that, with probability greater than $1-\varepsilon$ , $\hat{f}_{t}(\cdot)\in\mathcal{G}_{t}$ uniformly in $t\in\mathcal{T}$ and $\widehat{\phi}_{t,u}(\cdot)\in\mathcal{J}_{t,u}$ uniformly in $(t,u)\in\mathcal{T}\mathcal{U}$ . Here, we denote

[TABLE]

and

[TABLE]

We focus on the case in which $(\widehat{\phi}_{t,u},\hat{f}_{t})\in\mathcal{J}_{t,u}\times\mathcal{G}_{t}$ . Then

[TABLE]

where $(\overline{\phi},\overline{f})=(\widehat{\phi}_{t,u},\hat{f}_{t}).$

Below we fix $(\overline{\phi},\overline{f})\in\mathcal{J}_{t,u}\times\mathcal{G}_{t}.$ First,

[TABLE]

where the $o(h_{2}^{2})$ term holds uniformly in $(t,u)\in\mathcal{TU}$ . For term $III$ , uniformly over $(t,u)\in\mathcal{T}\mathcal{U}$ , we have

[TABLE]

The second equality of (A.15) follows because there exists a constant $c$ independent of $n$ such that

[TABLE]

and then

[TABLE]

The third equality of (A.15) holds because $\mathbb{E}(Y_{u}|X,T)=\phi_{T,u}(X)$ . The fourth equality of (A.15) holds by the fact that $||\overline{f}_{t}(X)-f_{t}(X)||_{\mathbb{P},\infty}=O(\delta_{n}h_{1}^{-1/2})=o(1)$ , $f_{t}(x)$ is assumed to be bounded away from zero uniformly over $t,\tau$ and the Cauchy inequality. The fifth inequality of (A.15) holds because

[TABLE]

and for some constant $c>0$ independent of $(t,u,n)$ ,

[TABLE]

For the term $II$ , we have

[TABLE]

where

[TABLE]

Note $\mathcal{F}$ has envelope $|\frac{\eta}{h_{2}}|$ ,

[TABLE]

The second last inequality in the above display holds because $f_{t}(x)$ is bounded away from zero uniformly in $(t,x)$ , where $t=T+h_{2}v$ belongs to some compact enlargement of $\mathcal{T}$ . Furthermore, $\mathcal{F}$ is nested by

[TABLE]

where

[TABLE]

In addition, we claim $||\max_{1\leq i\leq n}|\eta_{i}/h_{2}|||_{p,2}\lesssim\log(n)h_{2}^{-1}$ . When $\eta=1$ , the above claim holds trivially. When $\eta$ has sub-exponential tail, and the claim holds by van der Vaart and Wellner (1996, Lemma 2.2.2). Therefore, by Chernozhukov et al. (2014b, Corollary 5.1), we have

[TABLE]

Combining the bounds for $II$ , $III$ , and $IV$ , we have

[TABLE]

and

[TABLE]

Then, when $\eta=1$ ,

[TABLE]

Then, Assumption 5 implies that $\sup_{(t,u)\in\mathcal{T}\mathcal{U}}|R_{n}(t,u)|=o_{p}((nh_{2})^{-1/2})$ . For the bootstrap estimator, we have

[TABLE]

where $\sup_{(t,u)\in\mathcal{TU}}|R^{b}_{n}(t,u)|=O_{p}(\varepsilon_{n}^{2}(h_{2}^{-1/2}+\ell_{n}h_{1}^{-1/2})+\log(n)s\log(p\vee n)(nh_{2})^{-1})+o_{p}(h_{2}^{2}).$ This is because of the fact that

[TABLE]

and the collection of functions

[TABLE]

satisfies

[TABLE]

Therefore,

[TABLE]

where

[TABLE]

Proof of Theorem 3.4. Let $\hat{\alpha}^{\ast}(t,u)$ be either the original or the bootstrap estimator of $\alpha(t,u)$ . We first derive the linear expansion of the rearrangement of $\hat{\alpha}^{\ast}(t,u)$ defined in the proof of Theorem 3.3. For $z\in(0,1)$ , let

[TABLE]

where $\psi(\cdot)$ is defined in Section 3.3. Then, by Lemma B.2 in the online supplement, we have

[TABLE]

and

[TABLE]

where $s_{n}=(nh_{2})^{-1/2}$ , $d_{n}(t,v)=(nh_{2})^{1/2}(\hat{\alpha}^{\ast}(t,\psi^{\leftarrow}(v))-\alpha(t,\psi^{\leftarrow}(v))),$ $f_{Y(t)}(\cdot)$ is the density of $Y(t)$ , $q_{z}(t)$ is the $z$ -th quantile of $Y(t)$ , and $\delta_{n}$ equals to either $1$ or $h_{2}^{1/2}$ , depending on either Assumption 5.1 or 5.2 is in place.

Combining (A.17) and (A.18), we have

[TABLE]

uniformly over $(t,u)\in\mathcal{T}\mathcal{U}$ .

We can apply Lemma B.2 on $\hat{\alpha}^{*r}(t,u)$ again with $J_{n}(t,u)=(nh_{2})^{1/2}(\hat{\alpha}^{*r}(t,u)-\alpha(t,u))$ , $F(t,u)=P(Y(t)\leq u)=\alpha(t,u)$ , $f(t,u)=f_{Y(t)}(u)$ , and $F^{\leftarrow}(t,\tau)=q_{\tau}(t)$ . Then, for $\delta_{n}$ equals $1$ or $h_{2}^{1/2}$ under either Assumption 5.1 or 5.2, respectively, we have,

[TABLE]

uniformly over $(t,\tau)\in\mathcal{T}\mathcal{I}.$

When $\eta=1$ , combining (A.19), (A.20), and Theorem 3.3, we have

[TABLE]

By taking $\delta_{n}=1$ and $\delta_{n}=h_{2}^{1/2}$ under Assumptions 5.1 and 5.2, respectively, we have establish the desired results. For the bootstrap estimator, by (A), we have

[TABLE]

Then,

[TABLE]

By taking $\delta_{n}=1$ and $\delta_{n}=h_{2}^{1/2}$ under Assumptions 5.1 and 5.2, respectively, we have establish the linear expansion of the bootstrap estimator too. Last, note that the bootstrap estimator cannot preserve the asymptotic bias term. For the validity of bootstrap inference, we need to under-smooth and require $nh^{5}_{2}\rightarrow 0$ . This condition is assumed in Theorem 4.1.

Proof of Theorem 3.5. We consider the general case in which the observations are weighted by $\{\eta_{i}\}_{i=1}^{n}$ as above. For brevity, denote $\hat{\delta}:=(\hat{\delta}_{0},\hat{\delta}_{1})^{\prime}=(\hat{\beta}_{\tau}^{\ast 0}(t),\hat{\beta}_{\tau}^{\ast 1}(t))^{\prime}$ and $\delta:=(\delta_{0},\delta_{1})^{\prime}=(\beta_{\tau}^{0}(t),\beta_{\tau}^{1}(t)).$ For any variable $R_{n}:=R_{n}(\tau,t)$ and some deterministic sequence $r_{n}$ , we write $R_{n}=O_{p}^{*}(r_{n})$ (resp. $o_{p}^{*}(r_{n})$ ) if $\sup_{(t,\tau)\in\mathcal{TI}}|R_{n}(\tau,t)|=O_{p}(r_{n})$ (resp. $o_{p}(r_{n})$ ). Then $\hat{\delta}=\widehat{\Sigma}_{2}^{-1}\widehat{\Sigma}_{1},$ where

[TABLE]

and

[TABLE]

Let $\Sigma_{2}=\begin{pmatrix}f(t)&0\\ \kappa_{2}f^{(1)}(t)&\kappa_{2}f(t)\end{pmatrix}$ and $G=\begin{pmatrix}h_{2}^{-1}&0\\ 0&h_{2}^{-3}\end{pmatrix}$ . Then we have

[TABLE]

In addition, note

[TABLE]

and

[TABLE]

Therefore,

[TABLE]

Let $E(t,\tau)=\mathbb{E}\frac{Y_{q_{\tau}(t),j}-\phi_{t,q_{\tau}(t)}(X_{j})}{f_{t}(X_{j})h_{2}}K(\frac{T_{j}-t}{h_{2}})+\tau$ . By Theorem 3.4, we have

[TABLE]

Let $\Upsilon_{i}=(Y_{i},T_{i},X_{i},\eta_{i})$ . Then, by plugging (A.22) in (A.21) and noticing that

[TABLE]

we have

[TABLE]

where $\Gamma(\Upsilon_{i},\Upsilon_{j};t,\tau)=(\Gamma_{0}(\Upsilon_{i},\Upsilon_{j};t,\tau),\Gamma_{1}(\Upsilon_{i},\Upsilon_{j};t,\tau))^{\prime}$ , and

[TABLE]

for $\ell=0,1.$ Let $\Gamma^{s}(\Upsilon_{i},\Upsilon_{j};t,\tau)=(\Gamma(\Upsilon_{i},\Upsilon_{j};t,\tau)+\Gamma(\Upsilon_{j},\Upsilon_{i};t,\tau))/2$ . Because $nh_{2}^{7}\rightarrow 0$ , we have

[TABLE]

where $e_{2}=(0,1)^{\prime}$ and $U_{n}(t,\tau)=(C_{n}^{2})^{-1}\sum_{1\leq i<j\leq n}\eta_{i}\eta_{j}\Gamma^{s}(\cdot,\cdot;t,\tau)$ is a U-process indexed by $(t,\tau)$ . By Lemma B.3 in the online supplement,

[TABLE]

Combining (A.23) and (A.24), we have

[TABLE]

Proof of Theorem 4.1. By the proofs of Theorems 3.4 and 3.5, we have

[TABLE]

and

[TABLE]

Then, it is straightforward to show that $\sqrt{nh_{2}}(\hat{q}_{\tau}^{b}(t)-\hat{q}_{\tau}(t))$ and $(nh_{2}^{3})^{1/2}(\hat{\beta}_{\tau}^{1b}(t)-\hat{\beta}_{\tau}^{1}(t))$ converge weakly to the limiting distribution of $\sqrt{nh_{2}}(\hat{q}_{\tau}(t)-q_{\tau}(t))$ and $(nh_{2}^{3})^{1/2}(\hat{\beta}_{\tau}^{1}(t)-\beta_{\tau}^{1}(t))$ , respectively, conditional on data in the sense of van der Vaart and Wellner (1996, Section 2.9). The desired results then follow.

Appendix B Proofs of the Technical Lemmas

Lemma A.1 and Lemma B.1 below are closely related to Lemmas J.6 and O.2 in Belloni et al. (2017a) with one major difference: we have an additional kernel function which affects the rate of convergence. We follow the proof strategies in Belloni et al. (2017a) in general, but use the local compatibility condition established in Lemma 3.1 when needed. We include these proofs mainly for completeness. Lemma A.2 is proved without referring to the theory of moderate deviations for self-normalized sums, in contrast to the proof of Lemma J.1 in Belloni et al. (2017a). Consequently, we have the additional $\ell_{n}$ term but avoid one constraint on the rates of $p$ , $s$ , and $n$ , as well.

Proof of Lemma A.1. We define the following three events:

[TABLE]

and

[TABLE]

where $l$ , $L$ , and $C_{\psi}$ are defined in the statement of Lemma A.8 and the generic penalty loading matrix is $\widehat{\Psi}_{t,u}=\widehat{\Phi}_{t,u}^{k}$ for $k=0,\cdots,K$ .

By Assumption 2.4, for an arbitrary $\varepsilon>0$ , we can choose $C_{r}$ and $n$ sufficiently large so that $\mathbb{P}(E_{1})\geq 1-\varepsilon.$ By Lemma A.2 below and the fact that $\ell_{n}\rightarrow\infty$ , for any $\varepsilon>0$ and any $C_{\lambda}>0$ , for $n$ sufficiently large, we have $\mathbb{P}(E_{2})\geq 1-\varepsilon.$ In particular, we choose $C_{\lambda}$ such that $C_{\lambda}l>1$ . Last, by Lemma A.8 below, $\mathbb{P}(E_{3})>1-\varepsilon_{n}$ for some deterministic sequence $\varepsilon_{n}\downarrow 0$ .

From now on we assume $E_{1}$ , $E_{2}$ , and $E_{3}$ hold with constants $C_{r}$ , $C_{\lambda}$ , $l$ , and $L$ , which occurs with probability greater than $1-2\varepsilon-\varepsilon_{n}$ . Let $\delta_{t,u}=\hat{\theta}_{t,u}-\theta_{t,u}$ and $\mathcal{S}^{0}_{t,u}=\text{Supp}(\theta_{t,u})$ . Let

[TABLE]

and

[TABLE]

Then, under $E_{3}$ ,

[TABLE]

Let $Q_{t,u}(\theta)=\mathbb{P}_{n}M(Y_{u},X;\theta)K(\frac{T-t}{h_{1}})$ . By the fact that $\hat{\theta}_{t,u}$ solves the minimization problem in (3.5), we have

[TABLE]

Because the kernel function $K(\cdot)$ is nonnegative, $Q_{t,u}(\theta)$ is convex in $\theta$ . It follows that $Q_{t,u}(\hat{\theta}_{t,u})-Q_{t,u}(\theta_{t,u})\geq\partial_{\theta}Q_{t,u}(\theta_{t,u})^{\prime}\delta_{t,u}.$

Let $D_{t,u}=-\mathbb{P}_{n}b(X)\xi_{t,u}K(\frac{T-t}{h_{1}})$ and $\xi_{t,u}=Y_{u}-\phi_{t,u}(X)$ . Then,

[TABLE]

where $r_{t,u}^{\phi}=r_{t,u}^{\phi}(X)$ . Combining (B.1) and (B.2), we have

[TABLE]

Then

[TABLE]

We will consider two cases: $\delta_{t,u}\notin\Delta_{2\tilde{c},t,u}$ and $\delta_{t,u}\in\Delta_{2\tilde{c},t,u}.$

First, if $\delta_{t,u}\notin\Delta_{2\tilde{c},t,u}$ , i.e., $||(\delta_{t,u})_{\mathcal{S}_{t,u}^{0c}}||_{1}\geq 2\tilde{c}||(\delta_{t,u})_{\mathcal{S}^{0}_{t,u}}||_{1}$ , then

[TABLE]

Noting that $\tilde{c}\geq 1$ , we have

[TABLE]

Now, we consider the case where $\delta_{t,u}\in\Delta_{2\tilde{c},t,u}$ . By Lemma 3.1, we have

[TABLE]

In addition, $\omega_{t,u}\in(\underline{C}(1-\underline{C}),1/4)$ . If $\delta_{t,u}\in\Delta_{2\tilde{c},t,u}$ , then

[TABLE]

In this case, $||\delta_{t,u}||_{1}\leq(1+2\tilde{c})II_{t,u}$ .

In sum, we have

[TABLE]

and $\delta_{t,u}\in A_{t,u}:=\Delta_{2\tilde{c},t,u}\cup\{\delta:||\delta||_{1}\leq I_{t,u}\}.$

Recall $\tilde{r}_{t,u}^{\phi}=\Lambda^{-1}(\Lambda(b(X)^{\prime}\theta_{t,u})+r_{t,u}^{\phi})-b(X)^{\prime}\theta_{t,u}$ and denote

[TABLE]

Then, w.p.a.1., for some $\overline{r}_{t,u}^{\phi}$ between [math] and $r_{t,u}^{\phi}$ ,

[TABLE]

where the second line holds because $\sup_{(t,u)\in\mathcal{T}\mathcal{U}}||r_{t,u}^{\phi}||_{\mathbb{P},\infty}\overset{p}{\longrightarrow}0$ . In addition, by Lemma B.1 below and equations (B.1)–(B.3), we have

[TABLE]

where the last inequality holds because $|r_{t,u}^{\phi}|\leq|\tilde{r}_{t,u}^{\phi}|$ . If

[TABLE]

then

[TABLE]

and

[TABLE]

Since $E_{1}$ holds,

[TABLE]

Further note that $\lambda=\ell_{n}(\log(p\vee n)nh_{1})^{1/2}$ . Hence, if (B.4) holds, then (B.5) and (B.6) imply that

[TABLE]

with $C_{\Gamma}=3(9\tilde{c}[\underline{C}/2(1-\underline{C}/2)]^{-1}C_{r}+(LC_{\lambda}+1)2C_{\psi}(1+2\tilde{c})/\underline{\kappa})$ and

[TABLE]

with $C_{1}=\frac{2(1+2\tilde{c})}{\underline{\kappa}}C_{\Gamma}$ , which are the desired results.

Last, we verify (B.4). By Lemma B.1, since $\ell_{n}^{2}\log(p\vee n)s^{2}\zeta_{n}^{2}/(nh_{1})\rightarrow 0$ ,

[TABLE]

This concludes the proof.

Proof of Lemma A.2. By Lemma A.8 below, $\widehat{\Psi}_{t,u}^{-1}$ is bounded away from zero w.p.a.1, uniformly over $(t,u)$ . Therefore, we can just focus on bounding

[TABLE]

For $j$ -th element, $1\leq j\leq p$ ,

[TABLE]

where $c$ is a universal constant independent of $(j,t,u,n)$ . In addition,

[TABLE]

Therefore,

[TABLE]

Next, We turn to the centered term: $\sup_{g\in\mathcal{G}}|(\mathbb{P}_{n}-\mathbb{P})g|,$ where $\mathcal{G}=\{\xi_{t,u}b_{j}(X)K(\frac{T-t}{h_{1}}):(t,u)\in\mathcal{T}\mathcal{U},1\leq j\leq p\}$ with envelope $G=\overline{C}_{K}\zeta_{n}$ . Note that $\sup_{g\in\mathcal{G}}\mathbb{E}g^{2}\lesssim h_{1}$ and $\sup_{Q}N(\mathcal{G},e_{Q},\varepsilon||G||)\leq p\biggl{(}\frac{A}{\varepsilon}\biggr{)}^{v}$ for some $A>e$ and $v>0$ . So by Corollary 5.1 of Chernozhukov et al. (2014b), we have

[TABLE]

because $\log(p\vee n)\zeta_{n}^{2}/(nh_{1})\rightarrow 0$ .

Proof of Lemma A.8. For the first result, we have

[TABLE]

Let $\kappa_{1}=\int K(u)^{2}du$ . Then,

[TABLE]

Similarly,

[TABLE]

In addition, denote $\mathcal{F}=\{\frac{1}{h_{1}}K(\frac{T-t}{h_{1}})^{2}(Y_{u}-\phi_{t,u}(X))^{2}b_{j}^{2}(X):(t,u)\in\mathcal{TU},j=1,\cdots,p\}$ with envelope $C\zeta_{n}^{2}/h_{1}$ . The entropy of $\mathcal{F}$ is bounded by $p(\frac{A}{\varepsilon})^{v}$ . In addition, $\sup_{f\in\mathcal{F}}\mathbb{E}f^{2}\lesssim\zeta_{n}^{2}/h_{1}$ . Therefore,

[TABLE]

Therefore, w.p.a.1,

[TABLE]

For $k=0$ , we let $\mathcal{F}=\{\frac{1}{h_{1}}K(\frac{T-t}{h_{1}})^{2}Y^{2}_{u}b_{j}^{2}(X):(t,u)\in\mathcal{TU},j=1,\cdots,p\}$ with envelope $C\zeta_{n}^{2}/h_{1}$ . By the same argument as above, we can show that, w.p.a.1,

[TABLE]

For $k\geq 1$ , we have, w.p.a.1,

[TABLE]

Similarly, we can show that w.p.a.1.

[TABLE]

This concludes the second result with $C_{k}=C_{\psi}$ for $k=1,\cdots,K$ . The last result holds with $l=\min(C_{0}C_{\psi}/4,\cdots,C_{k}C_{\psi}/4,1)$ and $L=\max(4/(C_{0}C_{\psi}),\cdots,4/(C_{k}C_{\psi}),1).$

Proof of Lemma A.4. Following the same arguments as used in the proof of Lemma 3.1 and by Assumption 5, we have, w.p.a.1,

[TABLE]

where the second inequality holds because

[TABLE]

Lemma B.1

Recall that $Q_{t,u}(\theta)=\mathbb{P}_{n}M(Y_{u},X;\theta)K(\frac{T-t}{h_{1}})$ . Let $\overline{q}_{A_{t,u}}=\inf_{\delta\in A_{t,u}}\frac{[\mathbb{P}_{n}\omega_{t,u}|b(X)^{\prime}\delta|^{2}K(\frac{T-t}{h_{1}})]^{3/2}}{\mathbb{P}_{n}\omega_{t,u}|b(X)^{\prime}\delta|^{3}K(\frac{T-t}{h_{1}})},$ $\Gamma_{t,u}^{\delta}=||\omega_{t,u}^{1/2}b(X)^{\prime}\delta K(\frac{T-t}{h_{1}})^{1/2}||_{\mathbb{P}_{n},2}$ , and $s_{t,u}=||\theta_{t,u}||_{0}$ . Let events $E_{1}$ , $E_{2}$ , and $E_{3}$ defined in the proof of Lemma A.1 hold. Then, for any $(t,u)\in\mathcal{T}\mathcal{U}$ and $\delta\in A_{t,u}$ , we have

[TABLE]

and w.p.a.1,

[TABLE]

Proof. The proof follows closely from that of Lemma O.2 in Belloni et al. (2017a). Note that

[TABLE]

where $\tilde{g}_{t,u}(s)=\log[1+\exp(b(X)^{\prime}(\theta_{t,u}+s\delta))]K(\frac{T-t}{h_{1}})$ . Let $g_{t,u}(s)=\log[1+\exp(b(X)^{\prime}(\theta_{t,u}+s\delta)+\tilde{r}_{t,u}^{\phi})]K(\frac{T-t}{h_{1}})$ . Then

[TABLE]

and

[TABLE]

By Lemmas O.3 and O.4 in Belloni et al. (2017a),

[TABLE]

Let $\Upsilon_{t,u}(s)=\tilde{g}_{t,u}(s)-g_{t,u}(s).$ Then

[TABLE]

It follows that

[TABLE]

and

[TABLE]

We consider two cases: $\Gamma_{t,u}^{\delta}\leq\overline{q}_{A_{t,u}}$ and $\Gamma_{t,u}^{\delta}>\overline{q}_{A_{t,u}}.$

First, if $\Gamma_{t,u}^{\delta}\leq\overline{q}_{A_{t,u}}$ , we have

[TABLE]

and

[TABLE]

When $\Gamma_{t,u}^{\delta}>\overline{q}_{A_{t,u}}$ , we let $\tilde{\delta}=\delta\overline{q}_{A_{t,u}}/\Gamma_{t,u}^{\delta}\in A_{t,u}$ . Then by the convexity of $F_{t,u}(\delta)$ and the fact that $F_{t,u}(0)=0$ , we have

[TABLE]

Consequently, we have $F_{t,u}(\delta)\geq\min(\frac{1}{3}(\Gamma_{t,u}^{\delta})^{2},\frac{\overline{q}_{A_{t,u}}}{3}\Gamma_{t,u}^{\delta}).$

For the second result, note that

[TABLE]

If $\delta\in\Delta_{2\tilde{c},t,u}$ , then by Lemma 3.1

[TABLE]

If $||\delta||_{1}\leq I_{t,u}$ , where $I_{t,u}$ is defined in the proof of Lemma A.1, then

[TABLE]

Combining the above two results, we obtain that

[TABLE]

Lemma B.2

Let $q_{y}(t)$ be the $y$ -th quantile of $Y(t)$ , $f_{Y(t)}(\cdot)$ the unconditional density of $Y(t)$ ,

[TABLE]

$s_{n}=(nh_{2})^{-1/2}$ , $d_{n}(t,v)=(nh_{2})^{1/2}(\hat{\alpha}^{\ast}(t,\psi^{\leftarrow}(v))-\alpha(t,\psi^{\leftarrow}(v))),$ and $J_{n}(t,y)=\frac{F(t,y|d_{n})-F(t,y)}{s_{n}}.$ Then, for $\delta_{n}$ being either $1$ or $h_{2}^{1/2}$ , depending on either Assumption 5.1 or 5.2 is in place,

[TABLE]

and

[TABLE]

uniformly over $(t,y)\in\{(t,y):y=\alpha(t,\psi^{\leftarrow}(v)),(t,v)\in\mathcal{T}\times[0,1]\}.$

Proof. Let $Q(t,v)=\alpha(t,\psi^{\leftarrow}(v))$ for $v\in[0,1]$ . Then, we have

[TABLE]

We prove the lemma by applying Propositions C.1 and C.2 in Appendix C??.

First, we verify Assumption 7 with $(\delta_{n},\varepsilon_{n})=(1,(nh_{2})^{-1/2}\log(n))$ and $(\delta_{n},\varepsilon_{n})=(h_{2}^{1/2},(nh_{2})^{-1/2}\log(n))$ under Assumptions 5.1 and 5.2, respectively, in order to apply Proposition C.1 to prove (B.7). We only consider the case in which $\delta_{n}=h_{2}^{1/2}$ as the $\delta_{n}=1$ case can be studied similarly. Note that $Q(t,v)=\alpha(t,\psi^{\leftarrow}(v))$ , $\partial_{u}\alpha(t,u)=f_{Y(t)}(u)>0$ uniformly over $(t,u)\in\mathcal{T}\mathcal{U}$ , and $\psi(\cdot)$ can be chosen such that $\partial_{v}\psi^{\leftarrow}(v)>0$ uniformly over $v\in[0,1]$ . This verifies Assumption 7.1.

For Assumption 7.2, by Theorem 3.3, $\sup_{(t,v)\in\mathcal{T}\times[0,1]}|d_{n}(t,v)|=O_{p}(\log^{1/2}(n))$ . So we can take $\varepsilon_{n}=(nh_{2})^{-1/2}\log(n)$ . In addition, $\sup_{(t,v)\in\mathcal{T}\times[0,1]}|d_{n}^{2}(t,v)|s_{n}=O_{p}(\log(n)(nh_{2})^{-1/2})=o_{p}(h_{2}^{1/2})$ because $nh_{2}^{2}/\log^{2}(n)$ $\rightarrow\infty$ . So we only need to show

[TABLE]

Let

[TABLE]

with envelope $c\eta h_{2}^{-1}$ . By Theorem 3.3, we have

[TABLE]

$\sup_{(t,v)\in\mathcal{T}\times[0,1]}R_{n}(t,\psi^{\leftarrow}(v))=o_{p}(\delta_{n})$ . So we only have to show that

[TABLE]

We know that $\mathcal{G}$ is VC-type with fixed VC index and that $\sup_{g\in\mathcal{G}}\mathbb{E}g^{2}\leq\varepsilon_{n}h_{2}^{-1}.$ In addition, as shown in the proof of Theorem 3.4, $||\max_{1\leq i\leq n}|\eta_{i}h_{2}^{-1}|||_{P,2}\leq\log(n)/h_{2}$ . Therefore, by Corollary 5.1 of Chernozhukov et al. (2014b), we have

[TABLE]

Given $\varepsilon_{n}=(nh_{2})^{-1/2}\log(n)$ , $(\log(n)\varepsilon_{n})^{1/2}=o(h_{2}^{1/2})$ because $h_{2}=C_{2}n^{-H_{2}}$ for some $H_{2}<1/3$ . This establishes (B.9). Then (B.7) follows by Proposition C.1.

To prove(B.8), we apply Proposition C.2 by verifying Assumption 8. We note that $\hat{\alpha}^{\ast r}(t,u)=F^{\leftarrow}(t,\psi(u)|d_{n})$ and $J_{n}(t,y)=\frac{F(t,y|d_{n})-F(t,y)}{s_{n}}$ . Furthermore, notice that $\alpha^{\ast r}(t,u)=\alpha(t,u)=F^{\leftarrow}(t,\psi(u))$ , $F^{\leftarrow}(t,v)=\alpha(t,\psi^{\leftarrow}(v))$ ,

[TABLE]

and

[TABLE]

Because $f_{Y(t)}(q_{y}(t))$ is bounded and bounded away from zero uniformly over $(t,y)\in\mathcal{TY}$ , so be $\partial_{y}F(t,y)$ . In addition,

[TABLE]

which is bounded because $f_{Y(t)}^{\prime}(q_{y}(t))$ is bounded. This verifies Assumption 8.2.

For Assumption 8.3, we note that

[TABLE]

where the $o_{p}(\delta_{n})$ is uniform over $(t,y)\in\mathcal{TY}$ . In addition, by definition, $(t,q_{y}(t))\in\mathcal{T}\mathcal{U}$ , $f_{Y(t)}(q_{y}(t))$ is bounded away from zero, and we can choose $\psi$ such that $\psi^{\prime}(q_{y}(t))$ is bounded. Therefore, by Theorem 3.3 ,

[TABLE]

We can choose $\varepsilon_{n}=s_{n}\log(n)$ . In addition, $\sup_{(t,y)\in\mathcal{TY}}|J_{n}(t,y)|^{2}s_{n}=o_{p}(h_{2}^{1/2})$ because $nh_{2}^{3}\rightarrow\infty$ . So we only need to show that

[TABLE]

Note that, for $v=\psi(Q_{Y_{t}}(y))$ and $v^{\prime}=\psi(Q_{Y_{t}}(y^{\prime}))$

[TABLE]

In addition, $\phi(Q_{Y_{t}}(y))$ is Lipschitz uniformly over $(t,y)\in\mathcal{TY}$ . Thus,

[TABLE]

given that $h_{2}=C_{2}n^{-H_{2}}$ for some $H<1/3$ . This completes the verification of Assumption 8.2.

Last, it is essentially the same as above to verify Assumption 8 for $J_{n}(t,u)=(nh_{2})^{1/2}(\hat{\alpha}^{\ast r}(t,u)-\alpha(t,u))$ . The proof is omitted.

Lemma B.3

Suppose the conditions in Theorem 3.5 hold. Then

[TABLE]

Proof. Note that

[TABLE]

where $\mathcal{U}_{n}$ assigns probability $\frac{1}{n(n-1)}$ to each pair of observations and

[TABLE]

Let $\mathcal{H}=\{H(\cdot,\cdot;t,\tau),(t,$ $\tau)\in\mathcal{T}\mathcal{I}\}.$ Note that $\mathcal{H}$ is nested by a VC-class with fixed VC-index and has envelop $(C\sup_{i\neq j}|\eta_{i}\eta_{j}|h_{2}^{-2},C\sup_{i\neq j}|\eta_{i}\eta_{j}|h_{2}^{-3})^{\prime}$ for some large constant $C$ . Then, by Chen and Kato (2017, Corollary 5.6), there exist some constants $A\geq e$ and $v\geq 1$ such that

[TABLE]

which implies that

[TABLE]

Now we compute $\frac{2}{n}\sum_{j=1}^{n}\eta_{j}\mathbb{P}\Gamma^{s}(\cdot,\Upsilon_{j};t,\tau)$ , whose first and second elements are

[TABLE]

and

[TABLE]

respectively. By the usual maximal inequality,

[TABLE]

For the second element in $\mathbb{P}\Gamma^{s}(\cdot,\Upsilon_{j};t,\tau)$ , we first note that

[TABLE]

and

[TABLE]

Therefore, by the usual maximal inequality,

[TABLE]

Next, we turn to

[TABLE]

which has zero mean. Note that

[TABLE]

Therefore, by Chernozhukov et al. (2014b, Corollary 5.1), we have

[TABLE]

and

[TABLE]

In addition, note that

[TABLE]

Therefore, by Chernozhukov et al. (2014b, Corollary 5.1),

[TABLE]

and

[TABLE]

Combining the above results and denoting $\overline{K}(u)=\int vK(u-v)K(v)dv$ , we have

[TABLE]

and

[TABLE]

Combining (B.10), (B.11), and (B.12), we have the desired results.

Appendix C Rearrangement Operator on A Local Process

The rearrangement operator has been previously studied by Chernozhukov et al. (2010), in which they required the underlying process to be tight to apply the continuous mapping theorem. However, the local processes encountered in our paper are not tight due to the presence of the kernel function. Therefore, the original results on the rearrangement operate cannot directly apply to our case. Instead, in this section, we extend the results in Chernozhukov et al. (2010) to the case that the underlying process is not tight.

Let $Q(t,v)$ be a generic monotonic function in $v\in[0,1]$ . The functional $\Psi$ maps $Q(t,v)$ to $F(t,y)$ as follows:

[TABLE]

We want to derive a linear expansion of $\Psi(Q+s_{n}d_{n})-\Psi(Q)$ where $s_{n}\downarrow 0$ as the sample size $n\rightarrow\infty$ and $d_{n}(t,v)$ is some perturbation function.

Assumption 7

$Q(t,v)$ * is twice differentiable w.r.t. $v$ with both derivatives bounded. In addition, $\partial_{v}Q(t,v)>c$ for some positive constant $c$ , uniformly over $(t,v)\in\mathcal{T}\times[0,1]$ .* 2. 2.

There exist two vanishing sequences $\varepsilon_{n}$ and $\delta_{n}$ such that

[TABLE]

The following proposition extends the first part of Proposition 2 in Chernozhukov et al. (2010).

Proposition C.1

Let $(t,y)\in\mathcal{TY}:=\{(t,y):y=Q(t,v),(t,v)\in\mathcal{T}\times[0,1]\}$ , $F(t,y|d_{n})=\int_{0}^{1}1\{Q(t,v)+s_{n}d_{n}(t,v)\leq y\}dv$ , and $y=Q(t,v^{y})$ . If Assumption 7 holds, then

[TABLE]

*uniformly over $(t,y)\in\mathcal{TY}$ . *

Proof. Consider $(t_{n},y_{n})\rightarrow(t_{0},y_{0})$ and denote $v_{n}$ as $y_{n}=Q(t_{n},v_{n})$ . Note that

[TABLE]

Let $\mathbb{B}_{\varepsilon}(v)=\{v^{\prime}:\left|v-v^{\prime}\right|\leq\varepsilon\}$ . For fixed $n$ , if $v\in\mathbb{B}_{\varepsilon_{n}}(v_{n})\cap[0,1]$ , by Assumption 7,

[TABLE]

Then for any $\delta>0$ , there exists $n_{1}$ such that if $n\geq n_{1}$ , $|d_{n}(t_{n},v)-d_{n}(t_{n},v_{n})|\leq\delta\delta_{n}$ and

[TABLE]

If $v\notin\mathbb{B}_{\varepsilon_{n}}(v_{n})$ , then there exists $n_{2}$ such that for $n\geq n_{2}$ ,

[TABLE]

Furthermore, by Assumption 7,

[TABLE]

Therefore,

[TABLE]

and

[TABLE]

where the equality follows by the change of variables: $y=Q(t_{n},v)$ , $v_{n}(y)=Q^{\leftarrow}(t_{n},\cdot)(y)$ , and $\mathbb{J}_{n}$ is the image of $\mathbb{B}_{\varepsilon_{n}}(v_{n})$ . By (C.1) and Assumption 7.2, $[y_{n},y_{n}-s_{n}(d_{n}(t_{n},v_{n})-\delta\delta_{n})]$ is nested by $\mathbb{J}_{n}$ for $n$ sufficiently large. In addition, since $\partial_{v}Q(t,v)>c$ uniformly over $\mathcal{T}\times[0,1]$ , for $y\in[y_{n},y_{n}-s_{n}(d_{n}(t_{n},v_{n}))]$ ,

[TABLE]

Then the r.h.s. of (C.2) is bounded from above by

[TABLE]

where $\tilde{y}\in(y_{n}-s_{n}d_{n}(t_{n},v_{n}),y_{n}-s_{n}(d_{n}(t_{n},v_{n})-\delta\delta_{n}))$ . Since $\delta$ is arbitrary, by letting $\delta\rightarrow 0$ , we obtain that

[TABLE]

Similarly, we can show that

[TABLE]

Therefore, we have proved that

[TABLE]

Since the above result holds for any sequence of $(t_{n},y_{n})$ , then by Lemma 1 Chernozhukov et al. (2010), we have that uniformly over $(t,y)\in\mathcal{TY}$ ,

[TABLE]

This completes the proof of the proposition.

Let $F(t,y)$ and $F^{\leftarrow}(t,u)$ be a monotonic function and its inverse w.r.t. $y$ , respectively. Next, we consider the linear expansion of the inverse functional:

[TABLE]

where $s_{n}\downarrow 0$ as the sample size $n\rightarrow\infty$ and $J_{n}(t,y)$ is some perturbation function.

Assumption 8

$F(t,y)$ * has a compact support $\mathcal{TY}=\{(t,y):y=Q(t,v),(t,v)\in\mathcal{TV}:=\mathcal{T}\times\mathcal{V}\}$ . Denote $\mathcal{V}_{\varepsilon}$ , $\mathcal{TY}_{\varepsilon}$ , $\mathcal{Y}_{t\varepsilon}$ , and $\underline{y}_{t}$ as a compact subset of $\mathcal{V}$ , $\{(t,y):y=Q(t,v),(t,v)\in\mathcal{T}\times\mathcal{V}_{\varepsilon}\}$ , the projection of $\mathcal{TY}_{\varepsilon}$ on $T=t$ , and the lower bound of $\overline{(\mathcal{Y}_{\varepsilon t})^{\varepsilon}}$ , respectively. Then for any $t\in\mathcal{T}$ , $\underline{y}_{t}>-\infty$ and $\overline{(\mathcal{Y}_{\varepsilon t})^{\varepsilon}}\subset\mathcal{Y}_{t}$ .* 2. 2.

$F(t,y)$ * is monotonic and twice continuously differentiable w.r.t. $y$ . The first and second derivatives are denoted as $f(t,y)$ and $f^{\prime}(t,y)$ respectively. Then both $f(t,y)$ and $f^{\prime}(t,y)$ are bounded and $f(t,y)$ is also bounded away from zero, uniformly over $\mathcal{TY}$ .* 3. 3.

Let $\mathcal{T}\mathcal{Y}\mathcal{Y}=\{(t,y,y^{\prime}):y=Q(t,v),y^{\prime}=Q(t,v^{\prime}),(t,v,v^{\prime})\in\mathcal{T}\times\mathcal{V}\times\mathcal{V}\}$ . Then, there exist two vanishing sequences $\varepsilon_{n}$ and $\delta_{n}$ such that

[TABLE]

Proposition C.2

If Assumption 8 holds, then

[TABLE]

uniformly over $(t,v)\in\mathcal{TV}_{\varepsilon}$ .

Proof. Without loss of generality, we assume $F(t,y)$ is monotonically increasing in $y$ . Let $\xi(t,v)=F^{\leftarrow}(t,v)$ and $\xi_{n}(t,v)=(F+s_{n}J_{n})^{\leftarrow}(t,v).$ Since for $n$ sufficiently large, $\sup_{(t,v)\in\mathcal{TV}_{\varepsilon}}s_{n}|J_{n}^{\leftarrow}(t,v)|<\varepsilon$ and by the definition of $V_{\varepsilon}$ , we can choose $\xi(t,v)\in\mathcal{Y}_{t}$ and $\xi_{n}(t,v)\in\mathcal{Y}_{t}$ . In addition, since $F$ is differentiable, we have $F(t,\xi(t,v))=v$ . Denote $\eta_{n}(t,v)=\min(s_{n}\delta_{n}^{2},\xi_{n}(t,v)-\underline{y}_{t})$ . Then, the definition of the inverse function implies that

[TABLE]

Since $f(t,y)$ is bounded uniformly in $(t,y)\in\mathcal{TY}$ , we have

[TABLE]

and

[TABLE]

Therefore, (C.3) implies that

[TABLE]

Since $f(t,y)$ is bounded and bounded away from zero, we have

[TABLE]

Then,

[TABLE]

where the supremum in the second line is taken over $(t,y,y^{\prime})\in\mathcal{T}\mathcal{Y}\mathcal{Y}$ , $|y-y^{\prime}|\leq\max(\varepsilon_{n},s_{n}\delta_{n})$ , and the third line is because $f^{\prime}(t,y)$ is bounded uniformly in $(t,y)\in\mathcal{TY}$ .

On the other hand, by (C.3),

[TABLE]

Therefore, we have

[TABLE]

Similarly, we can show that

[TABLE]

The r.h.s. of (C.3) implies that

[TABLE]

Therefore,

[TABLE]

(C.4) and (C.5) imply that

[TABLE]

uniformly over $(t,v)\in\mathcal{TV}$ .

Appendix D Additional Simulation Results

This section investigates the sensitivity of bootstrap confidence intervals against the tuning parameters $h_{1}$ , $\tilde{\lambda}$ , and $\lambda$ , reports the finite sample performance for the oracle estimator and the estimator for the mean potential outcomes, and illustrates limitation of our method.

D.1 Sensitivity Analysis

We check the sensitivity of our estimation method with respect to three tuning parameters: $h_{1}$ , $\tilde{\lambda}$ , and $\lambda$ . We focus on the first design in Section 5. Figures 12 and 13 show the coverage probabilities of $q_{\tau}(t)$ and $\beta_{\tau}^{1}(t)$ with $h_{1}^{\prime}=0.8h_{1}$ and $h_{1}^{\prime}=1.2h_{1}$ , respectively. Figures 14 and 15 show the coverage probabilities of $q_{\tau}(t)$ and $\beta_{\tau}^{1}(t)$ with $\tilde{\lambda}^{\prime}=0.8\tilde{\lambda}$ and $\tilde{\lambda}^{\prime}=1.2\tilde{\lambda}$ , respectively, where $\tilde{\lambda}$ is the penalty used to estimate the conditional density $f_{t}(X)$ . Last, Figures 16 and 17 show the coverage probability $q_{\tau}(t)$ and $\beta_{\tau}^{1}(t)$ with $\lambda^{\prime}=0.8\lambda$ and $\lambda^{\prime}=1.2\lambda$ , respectively, where $\lambda$ is the penalty used to estimate the conditional CDF $\phi_{t,u}(X)$ . We observe that the coverage probabilities are in general not sensitive to the choice of tuning parameters.

D.2 Oracle Estimators

Next, we show the coverage probabilities for the oracle estimators in which the infinite-dimensional nuisance parameters are assumed to be known.

We see that the coverage rates for the oracle estimators are conservative, which is due to the way we construct the confidence intervals. However, we can also see that for some values of $t$ , the coverage rates are still very close to the nominal level 90% and most coverage rates do not exceed 95%.

D.3 The Mean of the Potential Outcome

We report the finite sample performance for the estimators for the mean of the potential outcome for $t\in[0.25,0.75]$ .

We observe that the estimators are quite accurate in terms of bias and variance. The coverage rates are reasonable for $t\in[0.25,0.75]$ in general. However, they are below the nominal rate $90\%$ when $t$ is close to $0.25$ and $0.75$ . Comparing with the oracle results reported below, we see that the drop of coverage rates is mainly due to the variable selection, which has a larger effect for $t$ that is closer to the boundary.

D.4 An Additional Design

Last, we consider a design that violates the approximate sparsity condition. The outcome and treatment equations are the same as (5.1) and (5.2), respectively. We let $\beta_{j}=\frac{\pi^{2}}{24}$ for $j=1,\cdots,10$ , $\beta_{j}=0$ , $j=11,\cdots,100$ , and $b(X)=X$ . In this case, $s=10$ . Recall that we have $nh_{1}\approxeq 47$ . However, our theory requires that $s/\sqrt{nh_{1}}\rightarrow 0$ . Such a condition is violated in this design.

We see that the coverage rates when $t=0.5$ are satisfactory. For $t=0.25$ and $t=0.75$ , the coverage rates are below the nominal 90%. On the other hand, the coverage rates for the oracle estimators reported below perform quite well. This implies that the drop of coverage rates for our estimators is mainly due to the variable selection, which may have a larger effect when $t$ is away from the center.131313Again, the cross-fitting technique promoted in Chernozhukov et al. (2018) may be helpful for eliminating the variable selection bias.

Appendix E Additional Empirical Illustration Results

This section investigates the sensitivity of our empirical application results with respect to three tuning parameters: $h_{1}$ , $\tilde{\lambda}$ , and $\lambda$ . We use the same model and dataset as in Section 6. Figures 33-38 are about the white individuals, and Figures 39-44 are about the black individuals. The captions for these figures are the same as in Figures 10 and 11. Figures 33 and 34 show the estimation results for $q_{\tau}(t)$ and $\beta_{\tau}^{1}(t)$ with $h_{1}^{\prime}=0.8h_{1}$ and $h_{1}^{\prime}=1.2h_{1}$ , respectively. Figures 35 and 36 show the estimation results for $q_{\tau}(t)$ and $\beta_{\tau}^{1}(t)$ with $\tilde{\lambda}^{\prime}=0.8\tilde{\lambda}$ and $\tilde{\lambda}^{\prime}=1.2\tilde{\lambda}$ , respectively, where $\tilde{\lambda}$ is the penalty used to estimate the conditional density $f_{t}(X)$ . Last, Figures 37 and 38 show the estimation results for $q_{\tau}(t)$ and $\beta_{\tau}^{1}(t)$ with $\lambda^{\prime}=0.8\lambda$ and $\lambda^{\prime}=1.2\lambda$ , respectively, where $\lambda$ is the penalty used to estimate the conditional CDF $\phi_{t,u}(X)$ .

E.1 Sensitivity results for the white individuals

E.2 Sensitivity results for the black individuals

Bibliography78

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Altonji and Matzkin (2005) Altonji, J. G., Matzkin, R. L., 2005. Cross section and panel data estimators for nonseparable models with endogenous regressors. Econometrica 73 (4), 1053–1102.
2Athey and Imbens (2016) Athey, S., Imbens, G., 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113 (27), 7353–7360.
3Begun et al. (1983) Begun, J. M., Hall, W., Huang, W.-M., Wellner, J. A., 1983. Information and asymptotic efficiency in parametric-nonparametric models. The Annals of Statistics 11 (2), 432–452.
4Belloni et al. (2012) Belloni, A., Chen, D., Chernozhukov, V., Hansen, C., 2012. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 (6), 2369–2429.
5Belloni et al. (2016) Belloni, A., Chen, M., Chernozhukov, V., 2016. Quantile graphical models: prediction and conditional independence with applications to financial risk management. ar Xiv:1607.00286.
6Belloni and Chernozhukov (2011) Belloni, A., Chernozhukov, V., 2011. ℓ 1 subscript ℓ 1 \ell_{1} -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics 39 (1), 82–130.
7Belloni et al. (2018 a) Belloni, A., Chernozhukov, V., Chetverikov, D., Wei, Y., 2018 a. Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework. The Annal of Statistics 46 (6B), 3643–3675.
8Belloni et al. (2017 a) Belloni, A., Chernozhukov, V., Fernández-Val, I., Hansen, C., 2017 a. Program evaluation with high-dimensional data. Econometrica 85 (1), 233–298.