Non-separable Models with High-dimensional Data
Liangjun Su, Takuya Ura, Yichong Zhang

TL;DR
This paper introduces a three-step estimation method for non-separable models with high-dimensional control variables, enabling the estimation of various treatment effects with theoretical guarantees and practical validation.
Contribution
It develops a novel three-step estimation procedure for high-dimensional non-separable models with continuous treatments, including inference methods and finite sample performance analysis.
Findings
Estimators perform well in finite samples.
The method effectively handles high-dimensional control variables.
Asymptotic properties are established for the estimators.
Abstract
This paper studies non-separable models with a continuous treatment when the dimension of the control variables is high and potentially larger than the effective sample size. We propose a three-step estimation procedure to estimate the average, quantile, and marginal treatment effects. In the first stage we estimate the conditional mean, distribution, and density objects by penalized local least squares, penalized local maximum likelihood estimation, and numerical differentiation, respectively, where control variables are selected via a localized method of L1-penalization at each value of the continuous treatment. In the second stage we estimate the average and marginal distribution of the potential outcome via the plug-in principle. In the third stage, we estimate the quantile and marginal treatment effects by inverting the estimated distribution function and using the local linearā¦
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference Ā· Advanced Causal Inference Techniques Ā· Statistical Methods and Bayesian Inference
Non-separable Models with High-dimensional Dataā ā thanks:
First draft: February, 2017. We are grateful to Alex Belloni, Xavier DāHaultfÅuille, Michael Qingliang Fan, Bryan Graham, Yu-Chin Hsu, Yuya Sasaki, and seminar participants at Academia Sinica, Duke, Asian Meeting of the Econometric Society, China Meeting of the Econometric Society, and the 7th Shanghai Workshop of Econometrics. Su acknowledges the funding support provided by the Lee Kong Chian Fund for Excellence.
Liangjun Su
School of Economics, Singapore Management University, 90 Stamford Road, Singapore 178903. E-mail: [email protected]. āā
Takuya Ura
Department of Economics, University of California, Davis. One Shields Avenue, Davis, CA 95616. E-mail: [email protected]. āā
Yichong Zhang
School of Economics, Singapore Management University, 90 Stamford Road, Singapore 178903. E-mail: [email protected].
Abstract
This paper studies non-separable models with a continuous treatment when the dimension of the control variables is high and potentially larger than the effective sample size. We propose a three-step estimation procedure to estimate the average, quantile, and marginal treatment effects. In the first stage we estimate the conditional mean, distribution, and density objects by penalized local least squares, penalized local maximum likelihood estimation, and numerical differentiation, respectively, where control variables are selected via a localized method of -penalization at each value of the continuous treatment. In the second stage we estimate the average and marginal distribution of the potential outcome via the plug-in principle. In the third stage, we estimate the quantile and marginal treatment effects by inverting the estimated distribution function and using the local linear regression, respectively. We study the asymptotic properties of these estimators and propose a weighted-bootstrap method for inference. Using simulated and real datasets, we demonstrate that the proposed estimators perform well in finite samples.
Keywords: Average treatment effect, High dimension, Least absolute shrinkage and selection operator (Lasso), Nonparametric quantile regression, Nonseparable models, Quantile treatment effect, Unconditional average structural derivative
JEL codes: C21, J62
1 Introduction
Non-separable models without additivity appear frequently in econometric analyses, because economic theory motivates a nonlinear role of the unobserved individual heterogeneity (Altonji and Matzkin, 2005) and its multi-dimensionality (Browning and Carro, 2007; Carneiro etĀ al., 2003; Cunha etĀ al., 2010). A large fraction of the previous literature on non-separable models has used control variables to achieve the unconfoundedness condition (Rosenbaum and Rubin, 1983), that is, the conditional independence between a regressor of interest (or a treatment) and the unobserved individual heterogeneity given the control variables. Although including high-dimensional control variables make unconfoundedness more plausible, the estimation and inference become more challenging, as well. It remains unanswered how to select control variables among potentially very many variables and conduct proper statistical inference for parameters of interest in non-separable models with a continuous treatment.
This paper proposes estimation and inference for unconditional parameters,111To be more specific, the parameters of interest are unconditional on covariates but conditional on the treatment level. including unconditional means of the potential outcomes, the unconditional cumulative distribution function, the unconditional quantile function, and the unconditional quantile partial derivative with the presence of both continuous treatment and high-dimensional covariates.222We focus on unconditional parameters, in which (potentially high-dimensional) covariates are employed to achieve the unconfoundedness but the parameters of interest are unconditional on the covariates. Unconditional parameters are simple to display and the simplicity is crucial especially when the covariates are high dimensional. As emphasized in Frƶlich and Melly (2013) and Powell (2010), unconditional parameters have two additional attractive features. First, by definition, they capture all the individuals in the sample at the same time instead of investigating the underlying structure separately for each subgroup defined by the covariates . The treatmen effect for the whole population is more policy-relevant. Second, an estimator for unconditional parameters can have better finite/large sample properties. The proposed method estimates the parameters of interest in three stages. The first stage selects controls by the method of least absolute shrinkage and selection operator (Lasso) and predicts reduced-form parameters such as the conditional expectation and distribution of the outcome given the variables and treatment level and the conditional density of the treatment given the control variables. We allow for different control variables to be selected at different values of the continuous treatment. The second stage recovers the average and the marginal distribution of the potential outcome by plugging the reduced-form parameters into doubly robust moment conditions. The last stage recovers the quantile of the potential outcome and its derivative with respect to the treatment by inverting the estimated distribution function and using the local linear regression, respectively. The inference is implemented via a weighted-bootstrap without recalculating the first stage variable selections, which saves considerable computation time.
To motivate our parameters of interest, we relate our estimands (the population objects that our procedure aims to recover) with the structural outcome function. Notably, we extend Hoderlein and Mammen (2007) and Sasaki (2015) to demonstrate that the unconditional derivative of the quantile of the potential outcome with respect to the treatment is equal to the weighted average of the marginal effects over individuals with same outcomes and treatments.
This paper contributes to two important strands of the econometric literature. The first is the literature on non-separable models with a continuous treatment, in which previous analyses have focused on a fixed and small number of control variables; see, e.g., Chesher (2003), Chernozhukov etĀ al. (2007), Hoderlein and Mammen (2007), Imbens and Newey (2009), Matzkin (1994) and Matzkin (2003). The second is a growing literature on recovering the causal effect from the high-dimensional data; see, e.g., Belloni etĀ al. (2012), Belloni etĀ al. (2014a), Chernozhukov etĀ al. (2015a), Chernozhukov etĀ al. (2015b), Farrell (2015), Athey and Imbens (2016), Chernozhukov etĀ al. (2017), Belloni etĀ al. (2014b), Wager and Athey (2018), Belloni etĀ al. (2017a), and Belloni etĀ al. (2017b). Our paper complements the previous works by studying both the variable selection and post-selection inference of causal parameters in a non-separable model with a continuous treatment. Recently, Cattaneo etĀ al. (2016), Cattaneo etĀ al. (2018a), and Cattaneo etĀ al. (2018b) have considered the semiparametric estimation of the causal effect in a setting with many included covariates and proposed novel bias-correction methods to conduct valid inference. Comparing with them, we deal with the fully nonparametric model with an ultra-high dimension of potential covariates, and rely on the approximate sparsity to reduce dimensionality.
The treatment variable being continuous imposes difficulties in both variable selection and post-selection inference. To address the former, we use penalized local Maximum Likelihood and Least Square estimations (hereafter, MLE and LS, respectively) to select control variables for each value of the continuous treatment. The penalized local LS was previously studied by Kong et al. (2015) and Lee and Mammen (2016).333We thank the referee for the reference. The local MLE complements the LS method by estimating a nonlinear and high-dimensional model with varying coefficients indexed by not only the continuous treatment variable but also a location variable. Our approach directly extend the distribution regression proposed in Chernozhukov et al. (2013) to the high-dimensional varying coefficient setting. By relying on kernel smoothing method, we require a different penalty loading than the traditional Lasso method. Chu et al. (2011) and Ning and Liu (2017) develop general theories of estimation, inference, and hypothesis testing of penalized (Pseudo) MLE. We complement their results by considering the local likelihood with an penalty term. Belloni et al. (2018a) construct uniformly valid confidence bands for the Z-estimators of unconditional moment equalities. Our results are not covered by theirs, either, as our parameters are defined based on conditional moment equalities. To prove the statistical properties of the penalized local MLE, we establish a local version of the compatibility condition (Bühlmann and van de Geer, 2011), which itself is new to the best of our knowledge.
For the post-selection inference, we establish doubly robust moment conditions for the continuous treatment effect model. Our parameters of interest is irregularly identified by the definition in Khan and Tamer (2010), as they are identified by a thin-set. Therefore, by averaging observations only when their treatment levels are close to the one of interest, the convergence rates of our estimators are nonparametric, which is in contrast with the -rate obtained in Belloni etĀ al. (2017a) and Farrell (2015). Albeit motivated by distinct models, Belloni etĀ al. (2016) also estimate the irregular identified parameters in the high-dimensional setting. However, the irregularity faced by Belloni etĀ al. (2016) is not due to the continuity of the variable of interest. Consequently, Belloni etĀ al. (2016) do not study the regularized estimator with localization as we do in this paper.
Estimation based on doubly robust moments is also related to the literature of semiparametric efficiency. The idea of doubly robust estimation can be traced back to the nonparametric efficiency theory for functional estimation developed by Begun etĀ al. (1983), Pfanzagl (1990), Bickel etĀ al. (1993), and Newey (1994). Robins and Rotnitzky (2001) and van der Laan and Robins (2003) study the semiparametric doubly robust estimators by modeling both the treatment and outcome processes. van der Laan and Dudoit (2003) allow for nonparametric modeling in causal inference problems. When both processes are nonparametrically estimated, the doubly robust methods can achieve faster rates of convergence than their nuisance estimator, making the estimator less sensitive to the curse of dimensionality and model selection bias. Their use in causal inference is also considered by Robins and Rotnitzky (1995), Hahn (1998), van der Laan and Robins (2003), Hirano etĀ al. (2003), van der Laan and Rubin (2006), Firpo (2007), Tsiatis (2007), van der Laan and Rose (2011), Kennedy etĀ al. (2017), and Robins etĀ al. (2017), among others.
Among the works above, our paper is most closely related to Kennedy etĀ al. (2017), who consider the doubly robust estimation for the average treatment effect when the treatment variable is continuous. Our paper complements theirs in four aspects. First, the estimation procedures are different. Kennedy etĀ al. (2017) first estimate the efficient influence function for the weighted average of the mean effect over all treatment levels, and then, use kernel smoothing to estimate the mean effect at each treatment level. On the contrary, we directly consider the doubly-robust moment for the parameters of interest. Second, Kennedy etĀ al. (2017) mainly focus on the mean effect, while we also consider quantile and marginal treatment effects. We obtain linear expansions for our estimators uniformly over both the quantile index and the treatment variable. Third, Kennedy etĀ al. (2017) do not construct detailed estimators of their nuisance parameters, but instead, impose high-level assumptions. To verify such high-level assumptions in the high-dimensional setting is nontrivial. In contrast, we provide valid estimators for our nuisance parameters via both regularization and localization, and derive their statistical properties. Fourth, we take into account the fact that the dimension of covariates may increase with the sample size so that the complexity of our nuisance parameter estimator measured by the uniform entropy will diverge to infinity. Such a situation is ruled out by Kennedy etĀ al. (2017).
To obtain uniformly valid results over values of the continuous treatment, we derive linear expansions of the rearrangement operator for a local process which is not tight, extending the existing results in Chernozhukov etĀ al. (2010).
We study the finite sample performance of our estimation procedure via Monte Carlo simulations and an empirical application. The simulations suggest that the proposed estimators perform reasonably well in finite samples. In the empirical exercise, we estimate the distributional effect of parental income on sonās income and intergenerational elasticity using the 1979 National Longitudinal Survey of Youth (NLSY79). We control for a large dimension of demographic variables. The quantiles of sonās potential income are in general upward slopping with respect to parental income, but for the subsample of blacks, the intergenerational elasticities are not statistically significant.
The rest of this paper is organized as follows. Section 2 presents the model and the parameters of interest. Section 3 proposes an estimation method in the presence of high-dimensional covariates. Section 4 demonstrates the validity of a bootstrap inference procedure. Section 5 presents Monte Carlo simulations. Section 6 illustrates the proposed estimator using NLSY79. Section 7 concludes. Proofs of the main theorems and Lemma 3.1 are reported in the appendix. Proofs of the rest of the lemmas are collected in an online supplement.
Throughout this paper, we adopt the convention that the capital letters, such as , , , denote random elements while their corresponding lower cases denote realizations. denotes an arbitrary positive constant that may not be the same in different contexts. For a sequence of random variables and a random variable , indicates weak convergence in the sense of vanĀ der Vaart and Wellner (1996). When and are -dimensional elements, the space of the sample path is equipped with Euclidean norm. When and are stochastic processes, the space of sample path is for some positive equipped with sup norm. The letters , , and denote the empirical process, expectation, and U-process, respectively. In particular, assigns probability to each observation and assigns probability to each pair of observations. also denotes expectation. We use and exchangeably. For any positive (random) sequence , if there exists a positive constant independent of such that , then we write . denotes norm under measure , where . If measure is omitted, the underlying measure is assumed to be the counting measure. For any vector , denotes the number of its nonzero coordinates. , the support of a -dimensional vector , is defined as . For , let be the cardinality of , be the complement of , and be the vector in that has the same coordinates as on and zero coordinates on . Last, let .
2 Model and Parameters of Interest
Econometricians observe an outcome , a continuous treatment , and a set of covariates , which may be high-dimensional. They are connected by a measurable function , i.e.,
[TABLE]
where is an unobservable random vector and may not be weakly separable from observables , and may not be monotone in either or .
Let . We are interested in the average , the marginal distribution for some , and the quantile , where we denote as the -th quantile of for some . We are also interested in the causal effect of moving from to , i.e., and . Last, we are interested in the average marginal effect and quantile partial derivative . Next, we specify conditions under which the above parameters are identified.
Assumption 1
The random variables and are conditionally independent given .
Assumption 1 is known as the unconfoundedness condition, which is commonly assumed in the treatment effect literature. See Cattaneo (2010), Cattaneo and Farrell (2011), Hirano etĀ al. (2003) and Firpo (2007) for the case of discrete treatment and Graham etĀ al. (2014), Galvao and Wang (2015), and Hirano and Imbens (2004) for the case of continuous treatment. It is also called the conditional independence assumption in Hoderlein and Mammen (2007), which is weaker than the full joint independence between and . Note that can be arbitrarily correlated with the unobservables . This assumption is more plausible when we control for sufficiently many and potentially high-dimensional covariates.
Theorem 2.1
Suppose Assumption 1 holds and is differentiable in its first argument. Then the marginal distribution of and the average marginal effect are identified. In addition, if Assumption 6 in the Appendix holds and is continuously distributed, then , where, for denoting the joint density of , is the probability measure on with density , where
[TABLE]
Several comments are in order. First, because the marginal distribution of is identified, so be its average, quantile, average marginal effect, and quantile partial derivative. As pointed out by Imbens and Newey (2009), a non-separable outcome with a general disturbance is equivalent to treatment effect models. Therefore, we can view as the potential outcome. Under unconfoundedness, the identification of the marginal distribution of the potential outcome with a continuous treatment has already been established in Hirano and Imbens (2004) and Galvao and Wang (2015). The first part of Theorem 2.1 just re-states their results. Second, the second result indicates that the partial quantile derivative identifies the weighted average marginal effect for the subpopulation with the same potential outcome, i.e., The result is closely related to, but different from Sasaki (2015). We consider the unconditional quantile of , whereas he considered the conditional quantile of given . Note that is not the average of the conditional quantile of given . Third, we require to be continuous just for the simplicity of derivation. If some elements of are discrete, a similar result can be established in a conceptually straightforward manner by focusing on the continuous covariates within samples homogenous in the discrete covariates, at the expense of additional notation. Finally, we do not require to be continuous when establishing the estimation and inference results below.
3 Estimation
Let denote the conditional density of evaluated at given and denote the Dirac function such that for any function ,
[TABLE]
In addition, let and for some . Then and can be identified by the method of generalized propensity score as proposed in Hirano and Imbens (2004), i.e.,
[TABLE]
There is a direct analogy between (3.1) for the continuous treatment and when the treatment is discrete: the indicator function shrinks to a Dirac function and the propensity score is replaced by the conditional density. Following this analogy, Hirano and Imbens (2004) called the generalized propensity.
Belloni etĀ al. (2017a) and Farrell (2015) considered the model with a discrete treatment and high-dimensional control variables, and proposed to use the doubly robust moment for inference. Following their lead, we propose the corresponding doubly robust moment when the treatment status is continuous. Let and , then
[TABLE]
and
[TABLE]
We propose the following three-stage procedure to estimate , , , and :
Estimate , , and by , and , respectively, using the first-stage bandwidth .
- 2.
Estimate and by
[TABLE]
and
[TABLE]
where and are a kernel function and the second-stage bandwidth, respectively. Then rearrange to obtain , which is monotone in .
- 3
Estimate by inverting with respect to (w.r.t.) , i.e., estimate by , which is the estimator of the slope coefficient in the local linear regression of on ; estimate by , which is the estimator of the slope coefficient in the local linear regression of on .
3.1 The First Stage Estimation
In this section, we define the first stage estimators and derive their asymptotic properties. Since , , and are local parameters w.r.t. , in addition to using penalty to select relevant covariates, we rely on a kernel function to implement the localization. In particular, we propose to estimate , , and by a penalized local LS, a penalized local MLE, and numerical differentiation, respectively.
3.1.1 Penalized Local LS and MLE
Recall and where . We approximate and by and , respectively, where is the logistic CDF and is a vector of basis functions with potentially large . In the case of high-dimensional covariates, is just , while in the case of nonparametric sieve estimation, is a series of bases of . The approximation errors for and are given by and respectively.
Note that we only approximate and by a linear regression and a logistic regression, respectively, with the approximation errors satisfying Assumption 2 below. Assumption 2 below puts a sparsity structure on and so that the number of effective covariates that can affect them is much smaller than . If the effective covariates are a few discrete variables that have a few categories, then we can saturate the regressions by low-dimensional dummy variables so that there is no approximate error. If some of the effective covariates are continuous, then we can include sieve bases in the linear regression so that the approximation error can still satisfy Assumption 2. One possible scenario that the approximate sparsity condition may fail is when there are a substantial amount of discrete variables that are all on the same footing (e.g., job occupation dummies). In this case, it is hard to define a sparse approximation.444We thank the Associate Editor for this point. Last, the coefficients and are both functional parameters that can vary with their indexes. This provides additional flexibility of our setup against misspecification.
We estimate and by and , respectively, where
[TABLE]
[TABLE]
denotes the norm, is the first-stage bandwidth, for some slowly diverging sequence , and . Our penalty term is different from the one used in Belloni etĀ al. (2017a) and Belloni etĀ al. (2018b), i.e., , where is some user-supplied constant, and is the standard normal CDF. Belloni etĀ al. (2017a) suggest , which implies that
[TABLE]
Therefore, our penalty term is of same order of magnitude of if is replaced with and is removed. We need to use in our penalty due to the presence of the kernel function in our estimation procedure. In particular, the effective sample size is of the same order of .555Note that and are of the same order of magnitude. We will specify the order of magnitude of in Assumption 2. The role played by in our penalty is similar to that of in , which is to control the selection error uniformly. We refer readers to Belloni etĀ al. (2017a, Equation (6.4)) for a more detailed discussion on this point. Since we do not use the advanced technique of self-normalized process as in Belloni etĀ al. (2017a), we multiply the sequence with while in , is additive to inside the square root. We propose a rule-of-thumb in Section 5 and study the sensitivity of our inference method against the choice of in Section D of the supplementary material.
In (3.4) and (3.5), and are generic penalty loading matrices. The infeasible loading matrices we would like to use are and in which
[TABLE]
and
[TABLE]
respectively. Since and are not known, we follow Belloni etĀ al. (2017a) and propose an iterative algorithm to obtain the feasible versions of the loading matrices. The statistical properties of the feasible loading matrices are summarized in Lemma A.8 in the Appendix.
Algorithm 3.1
Let and , where and Using and , we can compute and by (3.4) and (3.5). Let and for 2. 2.
For for some fixed positive integer , we compute and where
[TABLE]
and
[TABLE]
Using and , we can compute and by (3.4) and (3.5). Let and for The final penalty loading matrices and will be used for and in (3.4) and (3.5).
Let and contain the supports of and , respectively, such that , and . For each where and are compact subsets of the supports of and , respectively, the post-Lasso estimator of and based on the set of covariates and are defined as
[TABLE]
and
[TABLE]
The post-Lasso estimators of and are given by and , respectively.
3.1.2 Conditional Density Estimation
Following Belloni etĀ al. (2018b), we propose to first estimate , the conditional CDF of given , by the (logistic) distributional lasso regression studied in Belloni etĀ al. (2017a) and then take the numerical derivative. Following Belloni etĀ al. (2017a), we approximate by a Logistic CDF and the approximation error is denoted as . We estimate by , which is computed as
[TABLE]
where is the logistic likelihood as defined previously, the penalty
[TABLE]
is slightly modified from but of the same order of magnitude as used in Belloni etĀ al. (2017a) and Belloni etĀ al. (2018b), for some specified in Section 5, and the penalty loading is estimated in Algorithm 2 below, which is also due to Belloni etĀ al. (2017a):
Algorithm 3.2
Let where Using , we can compute and by the (logistic) distributional lasso regression. 2. 2.
For , we compute where
[TABLE]
Using , we can compute and by the (logistic) distributional lasso regression. The final penalty loading matrix will be used as in (3.6).
Then, , the conditional density of give is computed as
[TABLE]
where is the first-stage bandwidth.
3.1.3 Asymptotic Properties of the First Stage Estimators
To study the asymptotic properties of the first stage estimators, we need some assumptions.
Assumption 2
Let be a compact subset of the support of and be the support of .
The sample is i.i.d. 2. 2.
* and * 3. 3.
* for some which possibly depends on the sample size .* 4. 4.
* and*
[TABLE] 5. 5.
* and*
[TABLE] 6. 6.
* is second-order differentiable w.r.t. with bounded derivatives uniformly over , where is a compact subset of the support of and is the support of .* 7. 7.
,
Assumption 2.1 is common for cross-sectional observations. Assumption 2.2 is the same as Assumption 6.1(a) in Belloni etĀ al. (2017a). Assumption 2.3 requires that , , and are approximately sparse, i.e., they can be well-approximated by using at most elements of . This approximate sparsity condition is common in the literature on high-dimensional data (see, e.g., Belloni etĀ al. (2017a)). Assumption 2.4 and 2.5 specify how well the approximations are in terms of and norms. The exact rate for follows Belloni etĀ al. (2017a). The rates for and are different from that for because their approximations are local in . If the models for , , and are correctly specified and exactly sparse, i.e., the coefficients for all but regressors are zero, then there are no approximate errors. This implies , , and equal to zero so that Assumption 2.4 and 2.5 hold automatically. In the sieve estimation, is finite dimensional and is just a sequence of sieve bases of . Then , , and are the sieve approximation bias. Assumptions 2.3 and 2.4 can be verified under some smoothness conditions (see, e.g., Chen (2007)). Therefore, Assumption 2.4 and 2.5 are in spirit close to the smoothness condition. Assumption 2.6 is the smoothness of the true density, which is needed for the theoretical analysis of the numerical derivative. Because needs not be the whole support of , this condition is plausible. In a simple case, if , is bounded uniformly over , and is independent of and logistically distributed, then this condition holds. Assumption 2.7 imposes conditions on the rates at which , , and grow with sample size . It ensures that the first stage nuisance parameters are estimated with sufficient accuracy. In particular, we require . Comparing with the condition that imposed in Belloni etĀ al. (2017a), our condition reflects the local nature of our estimation procedure in the sense that our effective sample size is of order of magnitude .
Assumption 3
* is a symmetric probability density function (PDF) with*
[TABLE]
There exists a positive constant such that for 2. 2.
There exists some positive constant such that uniformly over . 3. 3.
* and are three times differentiable w.r.t. , with all three derivatives being bounded uniformly over * 4. 4.
For the same as above, uniformly over .
Assumption 3.1 holds for many kernel functions, e.g., uniform and Gaussian kernels. Since was referred to as the generalized propensity by Hirano and Imbens (2004), Assumption 3.2 is analogous to the overlapping support condition commonly assumed in the treatment effect literature; see, e.g., Hirano etĀ al. (2003) and Firpo (2007). Since the conditional density also has the sparsity structure as assumed in Assumption 2, at most members of ās affect the conditional density, which makes Assumption 3.2 more plausible. Assumption 3.3 imposes some smoothness conditions that are widely assumed in the nonparametric kernel literature. Assumption 3.4 holds if is compact.
Assumption 4
There exists a sequence such that, with probability approaching one,
[TABLE]
Assumption 4 is the restricted eigenvalue condition commonly assumed in the high-dimensional data literature. Based on Bickel etĀ al. (2009),
[TABLE]
are the minimal and maximal eigenvalues of Gram submatrices formed by any components of . Because , the matrix is not invertible. However, because , Assumption 4 implies that the Gram submatrices can still be invertible. We refer interested readers to Bickel et al. (2009) for more details and Bühlmann and van de Geer (2011) for a textbook treatment.
Since there is a kernel in the Lasso objective functions in (3.4) and (3.5), the asymptotic properties of and cannot be established by directly applying the results in Belloni etĀ al. (2017a). The key missing piece is the following local version of the compatibility condition. Let be an arbitrary subset of such that and for some independent of .
Lemma 3.1
If Assumptions 1ā4 hold, then there exists such that, w.p.a.1,
[TABLE]
Note in Lemma 3.1 is either the support of or the support of . For the latter case, the index is not needed. We refer to Lemma 3.1 as the local compatibility condition because (1) there is a kernel function implementing the localization; and (2) by the Cauchy inequality, Lemma 3.1 implies
[TABLE]
Bickel etĀ al. (2009, Lemma 4.2) show that, under Assumption 4, we have the following compatibility condition:
[TABLE]
which is the key convertibility condition used in high-dimensional analysis. We refer interested readers to Bühlmann and van de Geer (2011, Equation 6.4), the remarks after that, and Bühlmann and van de Geer (2011, Section 6.13) for more detailed discussions and further references. Under Assumption 4 and some regularity conditions assumed in the paper, Lemma 3.1 establishes a local version of (3.7). Based on Lemma 3.1, we can establish the following asymptotic probability bounds for the first stage estimators.
Theorem 3.1
Suppose Assumptions 1ā2, 3.1ā3.3, and 4 hold. Then
[TABLE]
[TABLE]
[TABLE]
[TABLE]
and . If in addition, Assumption 3.4 holds, then
[TABLE]
[TABLE]
[TABLE]
[TABLE]
and
Several comments are in order. First, due to the nonlinearity of the logistic link function, Assumption 3.4 is needed for deriving the asymptotic properties of the penalized local MLE estimators and . Second, the bounds in Theorem 3.1 are faster than by Assumption 5 below. This implies the estimators are sufficiently accurate so that in the second stage, their second and higher order impacts are asymptotically negligible. Last, the numbers of nonzero coordinates of and determine the complexity of our first stage estimators, which are uniformly controlled with a high probability.
For the conditional density estimation, we have the following results.
Theorem 3.2
Suppose Assumptions 1ā2, 3.1ā3.3, and 4 hold. Then
[TABLE]
[TABLE]
[TABLE]
[TABLE]
and
The rates of convergence in Theorem 3.2 are the same as those derived in Belloni etĀ al. (2018b, Section 8).
3.2 The Second Stage Estimation
Let and . For three generic functions , and of , denote
[TABLE]
and
[TABLE]
Then the estimators and can be written as
[TABLE]
where , , and are either the Lasso estimators (i.e., , , and ) or the post-Lasso estimators (i.e., , , and ) as defined in Section 3.1.
Assumption 5
Let for some positive constant .
, , and , and . 2. 2.
, , , and .
Theorem 3.3
Suppose Assumptions 1ā4 and 5.1 hold. Then
[TABLE]
and
[TABLE]
where
[TABLE]
[TABLE]
, and If Assumption 5.1 is replaced by Assumption 5.2, then
[TABLE]
Theorem 3.3 presents the Bahadur representations of the nonparametric estimators and with a uniform control on the remainder terms. For most purposes (e.g., to obtain the asymptotic distributions of these intermediate estimators or to obtain the results below), Assumption 5.1 is sufficient. Occasionally, one needs to impose Assumption 5.2 to have a better control on the remainder terms, say, when one conducts an -type specification test. See the remark after Theorem 3.4 below.
3.3 The Third Stage Estimation
Recall that denotes the -th quantile of , which is the inverse of w.r.t. . We propose to estimate by where and is the rearrangement of .
We rearrange to make it monotonically increasing in . Following Chernozhukov etĀ al. (2010), for a generic function , we define where can be any increasing bijective mapping: and is the inverse of . Then the rearrangement of is defined as
[TABLE]
where . Then the rearrangement for is
The rearrangement and inverse are two functionals operating on the process
[TABLE]
and are shown to be Hadamard differentiable by Chernozhukov etĀ al. (2010) and vanĀ der Vaart and Wellner (1996), respectively. However, by Theorem 3.3,
[TABLE]
which is not asymptotically tight. Therefore, the standard functional delta method used in Chernozhukov etĀ al. (2010) and vanĀ der Vaart and Wellner (1996) is not directly applicable. The next theorem overcomes this difficulty and establishes the linear expansion of the quantile estimator. Denote , , , and as , the -enlarged set of , the closure of , and the projection of on , respectively.
Theorem 3.4
Suppose that Assumptions 1ā4 and 5.1 hold. If for any , then
[TABLE]
where is the density of , , and If Assumption 5.1 is replaced by Assumption 5.2, then
[TABLE]
Under Assumption 5.2, the remainder term is uniformly in . This result is needed if one wants to establish an -type specification test of . For example, one may be interested in testing the null hypotheses of the quantile partial derivative being homogeneous across treatment. In this case, the null hypothesis can be written as
[TABLE]
and the alternative hypothesis is the negation of . One way to conduct a consistent test for the above hypothesis is to employ the residuals of the linear regression of on to construct the test statistic , i.e.,
[TABLE]
where are the linear coefficient estimators. This type of specification test has been previously studied by Su and Chen (2013), Lewbel etĀ al. (2015), Su etĀ al. (2015), Hoderlein etĀ al. (2016), and Su and Hoshino (2016) in various contexts. One can follow them and apply the results in Theorem 3.4 to study the asymptotic distribution of for each In addition, one can also consider either an integrated or a sup-version of and then study its asymptotic properties. For brevity we do not study such a specification test in this paper.
Given the estimators and , we can run local linear regressions of and on and obtain estimators and of and , respectively, as estimators of the linear coefficients in the local linear regression.666Alternatively, one can consider the local quadratic or cubic regression. Specifically, we define
[TABLE]
and
[TABLE]
where is the second-stage bandwidth. It is possible to use a third bandwidth in this step. Results similar to Theorem 3.5 below still holds if . Note that the usual optimal bandwidth for the kernel estimator of the derivative is . However, because , the requirement that implies the optimal bandwidth is not achievable. The key reason is that, unlike the usual local linear regression, we need to plug in the estimates of and . For simplicity, we just take
The following theorem shows the asymptotic properties of and .
Theorem 3.5
Suppose Assumptions 1ā4, and 5.1. If for any , then
[TABLE]
and
[TABLE]
where and .
Theorem 3.5 presents the Bahadur representations for and Since they are estimators for the first order derivatives and respectively, we can show that they converge to the true values at the -rate. Such a rate is common for kernel estimations of the first-order derivative of the conditional expectation, i.e., Li and Racine (2007, Theorem 2.10).
4 Inference
In this section, we study the inference for and We follow the lead of Belloni etĀ al. (2017a) and consider the weighted-bootstrap inference. Let be a sequence of i.i.d. random variables generated from the distribution of such that it has sub-exponential tails and unit mean and variance.777 A random variable has sub-exponential tails if for every and some constants and . For example, can be a standard exponential random variable or a normal random variable with unit mean and standard deviation. We conduct the bootstrap inference based on the following procedure.
Obtain , , , , and from the first stage. 2. 2.
For the -th bootstrap sample:
- ā¢
Generate from the distribution of .
- ā¢
Compute
[TABLE]
and
[TABLE]
where are either or .
- ā¢
Rearrange and obtain .
- ā¢
Invert w.r.t. and obtain .
- ā¢
Compute and as the slope coefficients of local linear regressions of on and on , respectively. 3. 3.
We repeat the above step for and obtain a bootstrap sample of
[TABLE] 4. 4.
Obtain , , , and as the -th quantile of the sequences , , , and , respectively.
The standard percentile bootstrap confidence interval for is
[TABLE]
However, in our simulation study, we find that it slightly undercovers. Instead, we use the fact that normal CDF is symmetric and propose to use the modified percentile bootstrap confidence interval as follows:
[TABLE]
where . We define , , and in the same manner. The following theorem summarizes the main results in this section.
Theorem 4.1
Suppose that Assumptions 1ā4 and 5.1 hold and . Then
[TABLE]
[TABLE]
[TABLE]
and
[TABLE]
Theorem 4.1 implies that, via under-smoothing, the bootstrap confidence intervals for , and have the correct asymptotic coverage probability We need to under-smooth because, regardless of under-smoothing, the bootstrap estimator is always center around the original estimator without the asymptotic bias. With more complicated notations and the arguments of strong approximation in Chernozhukov etĀ al. (2014b) and Chernozhukov etĀ al. (2014a), one can show that the validity of bootstrap inference holds uniformly over One of the key ingredients to verify Chernozhukov etĀ al. (2014a, Condition H1) is the linear expansions of the estimators with a uniform control of the reminder terms, which has already been established in Theorems 3.4 and 3.5.
5 Monte Carlo Simulations
This section presents the results of Monte Carlo simulations, which demonstrate the finite sample performance of the estimation and inference procedure. Let be generated as
[TABLE]
while be generated as
[TABLE]
where and are two standard logistic random variables such that and , and are the logistic and normal CDFs, respectively, , is a -dimensional random variables whose distribution is the Gaussian copula with covariance parameter , and is a vector of basis functions constructed from . Note that ranges from [math] to . The parameters of interest are and , where and . We consider the following three designs:
(Exact sparse) for , , and , ; 2. 2.
(Approximate sparsity) for and , ; 3. 3.
(Sieve basis) and , . We construct as the cubic spline basis functions of :
[TABLE]
where denotes the -th empirical quantile of , . This results in 169 basis functions. We further remove the basis functions with variance less than . We end up with about 128 basis functions on average.888The number of basis functions slightly varies across simulations.
Note that the sum of the coefficients are (approximately) for all three designs. We normalize the basis functions by their sample means and standard errors.
We use Gaussian kernel function in all three stages. We have four tuning parameters: , , , and . As we discussed in Section 3.1, we use
[TABLE]
where and . We use the rule-of-thumb bandwidth for , i.e., . Last, we build based on the rule-of-thumb bandwidth for the local quantile regression suggested by Yu and Jones (1998). In particular, Yu and Jones (1998) propose the bandwidths , where is a constant dependent only on , and and and is the bandwidth for the kernel estimation of .999We refer interested readers to (Yu and Jones, 1998, Table 1) for more details on . In our simulation studies, as is nearly constant over , we just choose for all the quantile index . We use the leave-one-out cross-validation to search for the optimal bandwidth of over a grid in . The resulting bandwidth is denoted as . In order to achieve under-smoothing, we define , where our choice of the factor follows Cai and Xiao (2012, p.418).
We repeat the bootstrap inference 500 times and all the results are based on 500 Monte Carlo simulations. The sample size is . Although the sample size is large compared to , in this DGP, the first-stage bandwidth is as small as . The effective sample size for the first-stage estimation is of order of magnitude of . In fact, we obtained warning signs of potential multi-collinearity and were unable to estimate the model when implementing the traditional estimation procedures without variable selection (i.e., without penalization).
The upper-left subplots in Figures 1, 4, 7 and 2, 5, 8 report the true functions of and for , and DGP 1, 2, and 3, respectively. Both and are heterogeneous across and , which imposes difficulties for estimation and inference. The rest of the subplots in the above Figures show the estimation biases and standard errors. We observe that all the biases of our estimators are of smaller order of magnitude than the standard error (std) and the root mean squared error (rMSE), which indicates the doubly robust moments effectively remove the selection bias induced by the Lasso method. The estimators of the quantile functions are very accurate. The estimators of the quantile partial derivatives are less so because they have slower convergence rates. Figures 3, 6, and 9 show that the 90% point-wise modified percentile bootstrap confidence intervals have reasonable performance for both the quantile functions and their derivatives, across all and values considered, with slight over-coverage for the quantile derivative functions. The results of variable selections depend on the values of and for conditional density estimation and penalized local MLE, respectively, which are tedious to report, Thus, they are omitted for brevity. Overall, 2 to 4 covariates are selected.
In Section D in the Appendix, we report the performance of oracle estimators for the three designs, in which oracle estimators are computed using the true conditional CDF and density functions. We also report the finite-sample performance of our mean potential outcome (i.e., ) estimators, which is similar to that of the quantile effect estimates reported here. Last, we consider an extra design in which the approximate sparsity condition may be violated and show that our method breaks down. We use this design to illustrate the limitation of our method.
6 Empirical Illustration
To investigate our proposed estimation and inference procedures, we use the 1979 National Longitudinal Survey of Youth (NLSY79) and consider the effect of fatherās income on sonās income in the presence of many control variables. Our analysis is based on Bhattacharya and Mazumder (2011). The data consist of a nationally representative sample of individuals with age 14-22 years old as of 1979. We use only white and black males and discard the individuals with missing values in the covariates we use. The resulting sample size is 1,795, out of which 1,302 individuals are white and 493 individuals are black.
The treatment variable of interest is the logarithm of fatherās income, in which fatherās income is computed as the average family income for 1978, 1979, and 1980. The outcome variable is the logarithm of son income, in which son income is computed as the average family income for 1997, 1999, 2001 and 2003. We create control variables by interacting a list of demographic variables with the cubic splines of the AFQT score and the years of education.101010The cubic splines for the AFQT score are constructed based on the normalized value by scaling the raw AFQT score into [0,1], where the knots are taken at the quantiles of the normalized AFQT score at . The cubic splines for the years of education are constructed in the same way. In this exercise, we do not interact the cubic splines for the AFQT score and the years of education. The list includes the age, the motherās education level, the fatherās education level, the indicators of (i) living in urban areas at age 14, (ii) living in the south, (iii) speaking a foreign language at childhood, and (iv) being born outside the U.S. We drop the variables whose variance is less than . The resulting numbers of control variables are 120 for whites and 145 for blacks.
We apply the proposed estimation and inference procedures for black and white individuals separately. We use the same tuning parameter choices as in the previous section.111111In Section E in the Appendix, we investigate the sensitivity of our estimation method with respect to the tuning parameters. As a result, our effective sample sizes are of orders of magnitude and for whites and blacks, respectively. Figures 10 and 11 show the estimated unconditional quantile functions and the estimated derivative, as well as the point-wise 90% confidence bands for and taking values at the , , and quantiles of the empirical distribution of . Under the context of intergenerational income mobility, the unconditional quantile and its derivative represent the quantile of sonās potential log income indexed by fatherās log income and the intergenerational elasticity, respectively. The unconditional quantile functions have a slight upward trend and the estimated derivative is positive in most parts of fatherās log income. The confidence bands for the unconditional quantile functions are quite narrow for both black and white individuals. For white individuals with the values of fatherās log income at the or quantile, we can reject the (locally) zero intergenerational elasticity for most of the values of . For the other cases, we cannot reject the (locally) zero intergenerational elasticity for almost all ās. This is considered as the cost of our fully nonparametric specification.
It is worthwhile to mention the variable selection in this application. the years of education, the AFQT score, the age, the fatherās education level, and the motherās education level are the leading control variables selected.121212More precisely, for whites, and are the two most selected control variables for the density estimations. and are the two most selected control variables for the penalized local MLE. For blacks, and are the two most selected control variables for the density estimations. and are the two most selected control variables for the penalized local MLE.
7 Conclusion
This paper studies non-separable models with a continuous treatment and high-dimensional control variables. It extends the existing results on the causal inference in non-separable models to the case with both continuous treatment and high-dimensional covariates. It develops a method based on localized -penalization to select covariates at each value of the continuous treatment. It then proposes a multi-stage estimation and inference procedure for average, quantile, and marginal treatment effects. The simulation and empirical exercises support the theoretical findings in finite samples.
Appendix
Appendix A Proof of the Main Results in the Paper
Before proving the theorem, we first introduce some additional notation and Assumption 6, which is a restatement of Sasaki (2015, Assumptions 1 and 2) in our framework. Denote by (resp. ) the dimensionality of (resp. ). We define and can be parametrized as a mapping from a -dimensional rectangle, denoted by , to . is the -dimensional Hausdorff measure restricted from to , where is the set of the interactions between and a Borel set in . (resp. ) is the velocity of at with respect to (resp. ).
Assumption 6
* is continuously differentiable.* 2. 2.
* on .* 3. 3.
The conditional distribution of given is absolutely continuous with respect to the Lebesgue measure, and is a continuously differentiable function of to . 4. 4.
. 5. 5.
* is a continuously differentiable function of to for every and is a continuously differentiable function of to for every .* 6. 6.
The mapping is a continuously differentiable function of to and is a continuously differentiable function of to . 7. 7.
There is with such that the mapping is bounded in and that the mapping is bounded in .
Assumption 6 is a combination of Assumptions 1 and 2 in Sasaki (2015). We refer the readers to the paper for detailed explanation.
Proof of Theorem 2.1. For the marginal distribution of , we note that, by Assumption 1, The first result follows as is identified.
For the second result, consider a random variable which has the same marginal distribution as and is independent of . Define
[TABLE]
Note that the (i) and are independent, and (ii) the -th quantile of given is for all , because . Assumption 6 implies Assumptions 1 and 2 in Sasaki (2015) for with , and then his Theorem 1 implies that the derivative of the -th quantile of given is equal to . Therefore, . Note that Theorem 1 in Sasaki (2015) does not apply directly to , because our assumptions do not imply that and are independent.
Lemma 3.1 is the local version of the compatibility condition, which is one of the key building blocks for Lemma A.1. Then, Lemma A.1 is used to prove Theorem 3.1.
Proof of Lemma 3.1. By Assumption 4, we can work on the set
[TABLE]
We use the same partition as in Bickel etĀ al. (2009). Let and be an integer which will be specified later. Partition , the complement of , as such that for , , where , for , contains the indexes corresponding to largest coordinates (in absolute value) of outside , and collects the remaining indexes. Further denote and . Then
[TABLE]
For the first term on the right hand side (r.h.s.) of (A.2), we have
[TABLE]
where the second inequality holds because
[TABLE]
We next bound the last term on the r.h.s. of (A.2). The second term can be bounded in the same manner. Let . Then we have
[TABLE]
Let be a sequence of Rademacher random variables which is independent of the data and with envelope . Denote as with . Then,
[TABLE]
where the first inequality is by vanĀ der Vaart and Wellner (1996, Lemma 2.3.1), the second inequality is by Ledoux and Talagrand (2013, Theorem 4.12) and the remark thereafter, and the third one is by applying Corollary 5.1 of Chernozhukov etĀ al. (2014b) with and, for some ,
[TABLE]
By Assumption 2, Then we have, w.p.a.1.,
[TABLE]
By the same token we can show that
[TABLE]
Therefore, we have, w.p.a.1.,
[TABLE]
Combining (A.3), (A.4), and (A.5) yields that w.p.a.1.,
[TABLE]
Analogously, we can show that, w.p.a.1,
[TABLE]
Following (A.2), we have, w.p.a.1,
[TABLE]
where the second inequality holds because, by construction, Since , , and thus, for large enough, the constant inside the brackets is greater than which is independent of . Therefore, we can conclude that, for large enough,
[TABLE]
This completes the proof of the lemma.
We aim to prove the results with regard to and in Theorem 3.1. The derivations for the results regarding and are exactly the same. We do not need to deal with the nonlinear logistic link function when deriving the results regarding , , , and . Therefore, the corresponding results can be shown by following the same proving strategy as below and treating defined below as . The proofs for results regarding , , , and are omitted for brevity.
Let , , , , and be the support of . We need the following four lemmas, whose proofs are relegated to the online supplement.
Lemma A.1
If Assumptions 1ā4 hold, then
[TABLE]
and
[TABLE]
Lemma A.2
Suppose Assumptions 1ā4 hold. Let Then
[TABLE]
Lemma A.3
If the assumptions in Theorem 3.1 hold, then there exists a constant such that w.p.a.1,
[TABLE]
For any and defined in Algorithm 2, there exists a constant such that, w.p.a.1,
[TABLE]
In addition, for any and defined in Algorithm 2, there exist constants independent of , , and such that, element-wise and w.p.a.1,
[TABLE]
Lemma A.4
If the assumptions in Theorem 3.1 hold, then w.p.a.1,
[TABLE]
Proof of Theorem 3.1. By the mean value theorem, there exist and such that
[TABLE]
where By the proof of Lemma A.1, we have, w.p.a.1,
[TABLE]
Therefore, by Lemma A.1 and Assumptions 4 and 5, we have
[TABLE]
where the last equality is because by Lemma A.1 and by Assumption 5. In addition, under Assumption 3.4 we have
[TABLE]
Hence, there exist some positive constants and only depending on such that, w.p.a.1,
[TABLE]
and uniformly over ,
[TABLE]
By Assumptions 3.3, 3.4, Lemma A.1, and the fact that is bounded and bounded away from zero uniformly over , we have, w.p.a.1,
[TABLE]
and
[TABLE]
Next, recall that . By the first order conditions (FOC), for any , we have
[TABLE]
Denote . By Lemmas A.1, A.2 and A.8, for any , with probability greater than , there exist positive constants and , which only depend on and are independent of , such that
[TABLE]
where and . This implies that there exists a constant only depending on , such that, with probability greater than ,
[TABLE]
Let . We claim that, for any , . Suppose not and there exists such that . Then,
[TABLE]
where the second inequality holds because of Belloni and Chernozhukov (2011, Lemma 23), the third inequality holds because for any , and the last inequality holds because . Therefore we reach a contradiction. In addition, by Lemma A.4, we can choose , which is independent of , such that
[TABLE]
This implies and thus with probability greater than , . This result holds uniformly over .
Last, we show that
[TABLE]
Let , , and
[TABLE]
By (A), (A), and (A.13), for any , there exists a constant such that, with probability greater than , uniformly in . Therefore, with probability greater than ,
[TABLE]
where \mathcal{F}=\left\{(J(X)-\phi_{t,u}(X))^{2}\biggl{[}K(\frac{T-t}{h_{1}})-\mathbb{E}(K(\frac{T_{i}-t}{h_{1}})|X)\biggr{]}:J\in\mathcal{J}_{t,u},(t,u)\in\mathcal{TU}\right\} with bounded envelope. Note that,
[TABLE]
In addition, we note that is nested by
[TABLE]
where
[TABLE]
Therefore, by Chernozhukov etĀ al. (2014b, Corollary 5.1), we have
[TABLE]
Therefore,
[TABLE]
where the last equality holds due to (A) and (A.14). Canceling the ās on both sides, we obtain the desired the result.
Proof of Theorem 3.2. By Belloni etĀ al. (2017a, Theorem 6.2), we have
[TABLE]
and
[TABLE]
Then, we have
[TABLE]
and similarly,
[TABLE]
Proof of Theorem 3.3. Let where either or is a random variable that has sub-exponential tails with unit mean and variance. When , , which is our original estimator. When is random, for ,
[TABLE]
is the bootstrap estimator. In the following, we establish the linear expansion of .
Recall and By Theorem 3.1 and 3.2, for any , there exists a constant such that, with probability greater than , uniformly in and uniformly in . Here, we denote
[TABLE]
and
[TABLE]
We focus on the case in which . Then
[TABLE]
where
Below we fix First,
[TABLE]
where the term holds uniformly in . For term , uniformly over , we have
[TABLE]
The second equality of (A.15) follows because there exists a constant independent of such that
[TABLE]
and then
[TABLE]
The third equality of (A.15) holds because . The fourth equality of (A.15) holds by the fact that , is assumed to be bounded away from zero uniformly over and the Cauchy inequality. The fifth inequality of (A.15) holds because
[TABLE]
and for some constant independent of ,
[TABLE]
For the term , we have
[TABLE]
where
[TABLE]
Note has envelope ,
[TABLE]
The second last inequality in the above display holds because is bounded away from zero uniformly in , where belongs to some compact enlargement of . Furthermore, is nested by
[TABLE]
where
[TABLE]
In addition, we claim . When , the above claim holds trivially. When has sub-exponential tail, and the claim holds by vanĀ der Vaart and Wellner (1996, Lemma 2.2.2). Therefore, by Chernozhukov etĀ al. (2014b, Corollary 5.1), we have
[TABLE]
Combining the bounds for , , and , we have
[TABLE]
and
[TABLE]
Then, when ,
[TABLE]
Then, Assumption 5 implies that . For the bootstrap estimator, we have
[TABLE]
where This is because of the fact that
[TABLE]
[TABLE]
and the collection of functions
[TABLE]
satisfies
[TABLE]
Therefore,
[TABLE]
where
[TABLE]
Proof of Theorem 3.4. Let be either the original or the bootstrap estimator of . We first derive the linear expansion of the rearrangement of defined in the proof of Theorem 3.3. For , let
[TABLE]
where is defined in Section 3.3. Then, by Lemma B.2 in the online supplement, we have
[TABLE]
and
[TABLE]
where , is the density of , is the -th quantile of , and equals to either or , depending on either Assumption 5.1 or 5.2 is in place.
Combining (A.17) and (A.18), we have
[TABLE]
uniformly over .
We can apply Lemma B.2 on again with , , , and . Then, for equals or under either Assumption 5.1 or 5.2, respectively, we have,
[TABLE]
uniformly over
When , combining (A.19), (A.20), and Theorem 3.3, we have
[TABLE]
By taking and under Assumptions 5.1 and 5.2, respectively, we have establish the desired results. For the bootstrap estimator, by (A), we have
[TABLE]
Then,
[TABLE]
By taking and under Assumptions 5.1 and 5.2, respectively, we have establish the linear expansion of the bootstrap estimator too. Last, note that the bootstrap estimator cannot preserve the asymptotic bias term. For the validity of bootstrap inference, we need to under-smooth and require . This condition is assumed in Theorem 4.1.
Proof of Theorem 3.5. We consider the general case in which the observations are weighted by as above. For brevity, denote and For any variable and some deterministic sequence , we write (resp. ) if (resp. ). Then where
[TABLE]
and
[TABLE]
Let and . Then we have
[TABLE]
In addition, note
[TABLE]
and
[TABLE]
Therefore,
[TABLE]
Let . By Theorem 3.4, we have
[TABLE]
Let . Then, by plugging (A.22) in (A.21) and noticing that
[TABLE]
we have
[TABLE]
where , and
[TABLE]
for Let . Because , we have
[TABLE]
where and is a U-process indexed by . By Lemma B.3 in the online supplement,
[TABLE]
Combining (A.23) and (A.24), we have
[TABLE]
Proof of Theorem 4.1. By the proofs of Theorems 3.4 and 3.5, we have
[TABLE]
and
[TABLE]
Then, it is straightforward to show that and converge weakly to the limiting distribution of and , respectively, conditional on data in the sense of vanĀ der Vaart and Wellner (1996, Section 2.9). The desired results then follow.
Appendix B Proofs of the Technical Lemmas
Lemma A.1 and Lemma B.1 below are closely related to Lemmas J.6 and O.2 in Belloni etĀ al. (2017a) with one major difference: we have an additional kernel function which affects the rate of convergence. We follow the proof strategies in Belloni etĀ al. (2017a) in general, but use the local compatibility condition established in Lemma 3.1 when needed. We include these proofs mainly for completeness. Lemma A.2 is proved without referring to the theory of moderate deviations for self-normalized sums, in contrast to the proof of Lemma J.1 in Belloni etĀ al. (2017a). Consequently, we have the additional term but avoid one constraint on the rates of , , and , as well.
Proof of Lemma A.1. We define the following three events:
[TABLE]
[TABLE]
and
[TABLE]
where , , and are defined in the statement of Lemma A.8 and the generic penalty loading matrix is for .
By Assumption 2.4, for an arbitrary , we can choose and sufficiently large so that By Lemma A.2 below and the fact that , for any and any , for sufficiently large, we have In particular, we choose such that . Last, by Lemma A.8 below, for some deterministic sequence .
From now on we assume , , and hold with constants , , , and , which occurs with probability greater than . Let and . Let
[TABLE]
and
[TABLE]
Then, under ,
[TABLE]
Let . By the fact that solves the minimization problem in (3.5), we have
[TABLE]
Because the kernel function is nonnegative, is convex in . It follows that
Let and . Then,
[TABLE]
where . Combining (B.1) and (B.2), we have
[TABLE]
Then
[TABLE]
We will consider two cases: and
First, if , i.e., , then
[TABLE]
Noting that , we have
[TABLE]
Now, we consider the case where . By Lemma 3.1, we have
[TABLE]
In addition, . If , then
[TABLE]
In this case, .
In sum, we have
[TABLE]
and
Recall and denote
[TABLE]
Then, w.p.a.1., for some between [math] and ,
[TABLE]
where the second line holds because . In addition, by Lemma B.1 below and equations (B.1)ā(B.3), we have
[TABLE]
where the last inequality holds because . If
[TABLE]
then
[TABLE]
and
[TABLE]
Since holds,
[TABLE]
Further note that . Hence, if (B.4) holds, then (B.5) and (B.6) imply that
[TABLE]
with and
[TABLE]
with , which are the desired results.
Last, we verify (B.4). By Lemma B.1, since ,
[TABLE]
This concludes the proof.
Proof of Lemma A.2. By Lemma A.8 below, is bounded away from zero w.p.a.1, uniformly over . Therefore, we can just focus on bounding
[TABLE]
For -th element, ,
[TABLE]
where is a universal constant independent of . In addition,
[TABLE]
Therefore,
[TABLE]
Next, We turn to the centered term: where with envelope . Note that and \sup_{Q}N(\mathcal{G},e_{Q},\varepsilon||G||)\leq p\biggl{(}\frac{A}{\varepsilon}\biggr{)}^{v} for some and . So by Corollary 5.1 of Chernozhukov etĀ al. (2014b), we have
[TABLE]
because .
Proof of Lemma A.8. For the first result, we have
[TABLE]
Let . Then,
[TABLE]
Similarly,
[TABLE]
In addition, denote with envelope . The entropy of is bounded by . In addition, . Therefore,
[TABLE]
Therefore, w.p.a.1,
[TABLE]
For , we let with envelope . By the same argument as above, we can show that, w.p.a.1,
[TABLE]
For , we have, w.p.a.1,
[TABLE]
Similarly, we can show that w.p.a.1.
[TABLE]
This concludes the second result with for . The last result holds with and
Proof of Lemma A.4. Following the same arguments as used in the proof of Lemma 3.1 and by Assumption 5, we have, w.p.a.1,
[TABLE]
where the second inequality holds because
[TABLE]
Lemma B.1
Recall that . Let , and . Let events , , and defined in the proof of Lemma A.1 hold. Then, for any and , we have
[TABLE]
and w.p.a.1,
[TABLE]
Proof. The proof follows closely from that of Lemma O.2 in Belloni etĀ al. (2017a). Note that
[TABLE]
where . Let . Then
[TABLE]
[TABLE]
and
[TABLE]
By Lemmas O.3 and O.4 in Belloni etĀ al. (2017a),
[TABLE]
Let Then
[TABLE]
It follows that
[TABLE]
and
[TABLE]
We consider two cases: and
First, if , we have
[TABLE]
and
[TABLE]
When , we let . Then by the convexity of and the fact that , we have
[TABLE]
Consequently, we have
For the second result, note that
[TABLE]
If , then by Lemma 3.1
[TABLE]
If , where is defined in the proof of Lemma A.1, then
[TABLE]
Combining the above two results, we obtain that
[TABLE]
Lemma B.2
Let be the -th quantile of , the unconditional density of ,
[TABLE]
, and Then, for being either or , depending on either Assumption 5.1 or 5.2 is in place,
[TABLE]
and
[TABLE]
uniformly over
Proof. Let for . Then, we have
[TABLE]
We prove the lemma by applying Propositions C.1 and C.2 in Appendix C??.
First, we verify Assumption 7 with and under Assumptions 5.1 and 5.2, respectively, in order to apply Proposition C.1 to prove (B.7). We only consider the case in which as the case can be studied similarly. Note that , uniformly over , and can be chosen such that uniformly over . This verifies Assumption 7.1.
For Assumption 7.2, by Theorem 3.3, . So we can take . In addition, because . So we only need to show
[TABLE]
Let
[TABLE]
with envelope . By Theorem 3.3, we have
[TABLE]
. So we only have to show that
[TABLE]
We know that is VC-type with fixed VC index and that In addition, as shown in the proof of Theorem 3.4, . Therefore, by Corollary 5.1 of Chernozhukov etĀ al. (2014b), we have
[TABLE]
Given , because for some . This establishes (B.9). Then (B.7) follows by Proposition C.1.
To prove(B.8), we apply Proposition C.2 by verifying Assumption 8. We note that and . Furthermore, notice that , ,
[TABLE]
and
[TABLE]
Because is bounded and bounded away from zero uniformly over , so be . In addition,
[TABLE]
which is bounded because is bounded. This verifies Assumption 8.2.
For Assumption 8.3, we note that
[TABLE]
where the is uniform over . In addition, by definition, , is bounded away from zero, and we can choose such that is bounded. Therefore, by Theorem 3.3 ,
[TABLE]
We can choose . In addition, because . So we only need to show that
[TABLE]
Note that, for and
[TABLE]
In addition, is Lipschitz uniformly over . Thus,
[TABLE]
given that for some . This completes the verification of Assumption 8.2.
Last, it is essentially the same as above to verify Assumption 8 for . The proof is omitted.
Lemma B.3
Suppose the conditions in Theorem 3.5 hold. Then
[TABLE]
Proof. Note that
[TABLE]
where assigns probability to each pair of observations and
[TABLE]
Let Note that is nested by a VC-class with fixed VC-index and has envelop for some large constant . Then, by Chen and Kato (2017, Corollary 5.6), there exist some constants and such that
[TABLE]
which implies that
[TABLE]
Now we compute , whose first and second elements are
[TABLE]
and
[TABLE]
respectively. By the usual maximal inequality,
[TABLE]
For the second element in , we first note that
[TABLE]
and
[TABLE]
Therefore, by the usual maximal inequality,
[TABLE]
Next, we turn to
[TABLE]
which has zero mean. Note that
[TABLE]
Therefore, by Chernozhukov etĀ al. (2014b, Corollary 5.1), we have
[TABLE]
and
[TABLE]
In addition, note that
[TABLE]
Therefore, by Chernozhukov etĀ al. (2014b, Corollary 5.1),
[TABLE]
and
[TABLE]
Combining the above results and denoting , we have
[TABLE]
and
[TABLE]
Combining (B.10), (B.11), and (B.12), we have the desired results.
Appendix C Rearrangement Operator on A Local Process
The rearrangement operator has been previously studied by Chernozhukov etĀ al. (2010), in which they required the underlying process to be tight to apply the continuous mapping theorem. However, the local processes encountered in our paper are not tight due to the presence of the kernel function. Therefore, the original results on the rearrangement operate cannot directly apply to our case. Instead, in this section, we extend the results in Chernozhukov etĀ al. (2010) to the case that the underlying process is not tight.
Let be a generic monotonic function in . The functional maps to as follows:
[TABLE]
We want to derive a linear expansion of where as the sample size and is some perturbation function.
Assumption 7
* is twice differentiable w.r.t. with both derivatives bounded. In addition, for some positive constant , uniformly over .* 2. 2.
There exist two vanishing sequences and such that
[TABLE]
[TABLE]
The following proposition extends the first part of Proposition 2 in Chernozhukov etĀ al. (2010).
Proposition C.1
Let , , and . If Assumption 7 holds, then
[TABLE]
*uniformly over . *
Proof. Consider and denote as . Note that
[TABLE]
Let . For fixed , if , by Assumption 7,
[TABLE]
Then for any , there exists such that if , and
[TABLE]
If , then there exists such that for ,
[TABLE]
Furthermore, by Assumption 7,
[TABLE]
Therefore,
[TABLE]
and
[TABLE]
where the equality follows by the change of variables: , , and is the image of . By (C.1) and Assumption 7.2, is nested by for sufficiently large. In addition, since uniformly over , for ,
[TABLE]
Then the r.h.s. of (C.2) is bounded from above by
[TABLE]
where . Since is arbitrary, by letting , we obtain that
[TABLE]
Similarly, we can show that
[TABLE]
Therefore, we have proved that
[TABLE]
Since the above result holds for any sequence of , then by Lemma 1 Chernozhukov etĀ al. (2010), we have that uniformly over ,
[TABLE]
This completes the proof of the proposition.
Let and be a monotonic function and its inverse w.r.t. , respectively. Next, we consider the linear expansion of the inverse functional:
[TABLE]
where as the sample size and is some perturbation function.
Assumption 8
* has a compact support . Denote , , , and as a compact subset of , , the projection of on , and the lower bound of , respectively. Then for any , and .* 2. 2.
* is monotonic and twice continuously differentiable w.r.t. . The first and second derivatives are denoted as and respectively. Then both and are bounded and is also bounded away from zero, uniformly over .* 3. 3.
Let . Then, there exist two vanishing sequences and such that
[TABLE]
[TABLE]
Proposition C.2
If Assumption 8 holds, then
[TABLE]
uniformly over .
Proof. Without loss of generality, we assume is monotonically increasing in . Let and Since for sufficiently large, and by the definition of , we can choose and . In addition, since is differentiable, we have . Denote . Then, the definition of the inverse function implies that
[TABLE]
Since is bounded uniformly in , we have
[TABLE]
and
[TABLE]
Therefore, (C.3) implies that
[TABLE]
Since is bounded and bounded away from zero, we have
[TABLE]
Then,
[TABLE]
where the supremum in the second line is taken over , , and the third line is because is bounded uniformly in .
On the other hand, by (C.3),
[TABLE]
Therefore, we have
[TABLE]
Similarly, we can show that
[TABLE]
The r.h.s. of (C.3) implies that
[TABLE]
Therefore,
[TABLE]
[TABLE]
uniformly over .
Appendix D Additional Simulation Results
This section investigates the sensitivity of bootstrap confidence intervals against the tuning parameters , , and , reports the finite sample performance for the oracle estimator and the estimator for the mean potential outcomes, and illustrates limitation of our method.
D.1 Sensitivity Analysis
We check the sensitivity of our estimation method with respect to three tuning parameters: , , and . We focus on the first design in Section 5. Figures 12 and 13 show the coverage probabilities of and with and , respectively. Figures 14 and 15 show the coverage probabilities of and with and , respectively, where is the penalty used to estimate the conditional density . Last, Figures 16 and 17 show the coverage probability and with and , respectively, where is the penalty used to estimate the conditional CDF . We observe that the coverage probabilities are in general not sensitive to the choice of tuning parameters.
D.2 Oracle Estimators
Next, we show the coverage probabilities for the oracle estimators in which the infinite-dimensional nuisance parameters are assumed to be known.
We see that the coverage rates for the oracle estimators are conservative, which is due to the way we construct the confidence intervals. However, we can also see that for some values of , the coverage rates are still very close to the nominal level 90% and most coverage rates do not exceed 95%.
D.3 The Mean of the Potential Outcome
We report the finite sample performance for the estimators for the mean of the potential outcome for .
We observe that the estimators are quite accurate in terms of bias and variance. The coverage rates are reasonable for in general. However, they are below the nominal rate when is close to and . Comparing with the oracle results reported below, we see that the drop of coverage rates is mainly due to the variable selection, which has a larger effect for that is closer to the boundary.
D.4 An Additional Design
Last, we consider a design that violates the approximate sparsity condition. The outcome and treatment equations are the same as (5.1) and (5.2), respectively. We let for , , , and . In this case, . Recall that we have . However, our theory requires that . Such a condition is violated in this design.
We see that the coverage rates when are satisfactory. For and , the coverage rates are below the nominal 90%. On the other hand, the coverage rates for the oracle estimators reported below perform quite well. This implies that the drop of coverage rates for our estimators is mainly due to the variable selection, which may have a larger effect when is away from the center.131313Again, the cross-fitting technique promoted in Chernozhukov etĀ al. (2018) may be helpful for eliminating the variable selection bias.
Appendix E Additional Empirical Illustration Results
This section investigates the sensitivity of our empirical application results with respect to three tuning parameters: , , and . We use the same model and dataset as in Section 6. Figures 33-38 are about the white individuals, and Figures 39-44 are about the black individuals. The captions for these figures are the same as in Figures 10 and 11. Figures 33 and 34 show the estimation results for and with and , respectively. Figures 35 and 36 show the estimation results for and with and , respectively, where is the penalty used to estimate the conditional density . Last, Figures 37 and 38 show the estimation results for and with and , respectively, where is the penalty used to estimate the conditional CDF .
E.1 Sensitivity results for the white individuals
E.2 Sensitivity results for the black individuals
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Altonji and Matzkin (2005) Altonji, J. G., Matzkin, R. L., 2005. Cross section and panel data estimators for nonseparable models with endogenous regressors. Econometrica 73 (4), 1053ā1102.
- 2Athey and Imbens (2016) Athey, S., Imbens, G., 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113 (27), 7353ā7360.
- 3Begun et al. (1983) Begun, J. M., Hall, W., Huang, W.-M., Wellner, J. A., 1983. Information and asymptotic efficiency in parametric-nonparametric models. The Annals of Statistics 11 (2), 432ā452.
- 4Belloni et al. (2012) Belloni, A., Chen, D., Chernozhukov, V., Hansen, C., 2012. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 (6), 2369ā2429.
- 5Belloni et al. (2016) Belloni, A., Chen, M., Chernozhukov, V., 2016. Quantile graphical models: prediction and conditional independence with applications to financial risk management. ar Xiv:1607.00286.
- 6Belloni and Chernozhukov (2011) Belloni, A., Chernozhukov, V., 2011. ā 1 subscript ā 1 \ell_{1} -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics 39 (1), 82ā130.
- 7Belloni et al. (2018 a) Belloni, A., Chernozhukov, V., Chetverikov, D., Wei, Y., 2018 a. Uniformly valid post-regularization confidence regions for many functional parameters in z-estimation framework. The Annal of Statistics 46 (6B), 3643ā3675.
- 8Belloni et al. (2017 a) Belloni, A., Chernozhukov, V., FernĆ”ndez-Val, I., Hansen, C., 2017 a. Program evaluation with high-dimensional data. Econometrica 85 (1), 233ā298.
