Targeted Undersmoothing
Christian Hansen, Damian Kozbur, Sanjog Misra

TL;DR
This paper introduces targeted undersmoothing, a post-model selection inference method that constructs valid confidence sets for complex functionals in high-dimensional models, demonstrated through empirical examples and simulations.
Contribution
It presents a novel inference procedure for high-dimensional models that accounts for model selection uncertainty, especially for dense functionals.
Findings
Effective in estimating heterogeneous treatment effects.
Provides valid confidence sets in high-dimensional settings.
Shows good finite sample performance in simulations.
Abstract
This paper proposes a post-model selection inference procedure, called targeted undersmoothing, designed to construct uniformly valid confidence sets for a broad class of functionals of sparse high-dimensional statistical models. These include dense functionals, which may potentially depend on all elements of an unknown high-dimensional parameter. The proposed confidence sets are based on an initially selected model and two additionally selected models, an upper model and a lower model, which enlarge the initially selected model. We illustrate application of the procedure in two empirical examples. The first example considers estimation of heterogeneous treatment effects using data from the Job Training Partnership Act of 1982, and the second example looks at estimating profitability from a mailing strategy based on estimated heterogeneous treatment effects in a direct mail marketing…
| Estimator | W-statistic | df | p-value |
|---|---|---|---|
| OLS | 679.14 | 313 | 0.0000 |
| PL | 17.1444 | 7 | 0.0088 |
| TU(1) | 16.4910 | 7 | 0.0210 |
| TU(2) | 15.9709 | 7 | 0.0254 |
| TU(3) | 15.5022 | 7 | 0.0301 |
| TU(4) | 15.0803 | 7 | 0.0350 |
| TU(5) | 14.7097 | 7 | 0.0399 |
| TU(6) | 14.4253 | 7 | 0.0441 |
| TU(7) | 14.1517 | 7 | 0.0485 |
| TU(8) | 13.9339 | 7 | 0.0524 |
| TU(9) | 13.5463 | 7 | 0.0599 |
| TU(10) | 13.3584 | 7 | 0.0638 |
| Estimator | W-statistic | df | p-value |
|---|---|---|---|
| PL | 20.6884 | 9 | 0.0141 |
| TU(1) | 19.4059 | 10 | 0.0354 |
| TU(2) | 18.1018 | 10 | 0.0533 |
| TU(3) | 17.5105 | 10 | 0.0638 |
| TU(4) | 16.8746 | 10 | 0.0772 |
| TU(5) | 16.3060 | 10 | 0.0912 |
| TU(6) | 15.7466 | 10 | 0.1071 |
| TU(7) | 15.2801 | 10 | 0.1222 |
| TU(8) | 14.8188 | 10 | 0.1388 |
| TU(9) | 14.3024 | 10 | 0.1596 |
| TU(10) | 13.9031 | 10 | 0.1775 |
| Estimator | W-statistic | df | p-value |
|---|---|---|---|
| OLS | 1865.7525 | 1069 | 0.000 |
| PL | 692.4930 | 45 | 0.000 |
| TU(1) | 685.5655 | 45 | 0.000 |
| TU(2) | 680.9011 | 45 | 0.000 |
| TU(3) | 678.0659 | 45 | 0.000 |
| TU(4) | 675.3192 | 45 | 0.000 |
| TU(5) | 672.9171 | 45 | 0.000 |
| TU(6) | 671.3020 | 45 | 0.000 |
| TU(7) | 669.6907 | 45 | 0.000 |
| TU(8) | 668.4609 | 45 | 0.000 |
| TU(9) | 667.4802 | 45 | 0.000 |
| TU(10) | 666.4816 | 45 | 0.000 |
| True | All | Double | Lasso | PL | LCV | ZB | TU(1) | TU(10) | |
| A. RegCoef | |||||||||
| Bias | 0.04 | 0.05 | 0.09 | -0.13 | -0.19 | -0.33 | |||
| Std. Dev. | 0.68 | 0.79 | 0.62 | 0.11 | 0.37 | 0.58 | |||
| RMSE | 0.68 | 0.79 | 0.63 | 0.16 | 0.41 | 0.66 | |||
| Coverage | 0.91 | 0.91 | 0.93 | 0.14 | 0.10 | 0.54 | 0.76 | 0.93 | 0.97 |
| Int. Length | 2.46 | 2.70 | 2.26 | 0.28 | 0.33 | 1.42 | 0.98 | 1.97 | 3.86 |
| B. TE | |||||||||
| Bias | 0.01 | -0.00 | 0.27 | -0.01 | -0.00 | ||||
| Std. Dev. | 0.24 | 1.57 | 0.15 | 0.30 | 0.35 | ||||
| RMSE | 0.25 | 1.57 | 0.31 | 0.30 | 0.35 | ||||
| Coverage | 0.91 | 0.94 | 0.56 | 0.76 | 0.94 | 0.95 | 0.98 | 1.00 | |
| Int. Length | 0.88 | 5.74 | 0.67 | 0.65 | 1.49 | 3.41 | 1.76 | 5.44 | |
| C. PI | |||||||||
| Bias | 0.01 | 0.32 | -0.14 | -0.01 | -0.05 | ||||
| Std. Dev. | 0.06 | 0.07 | 0.01 | 0.08 | 0.06 | ||||
| RMSE | 0.06 | 0.33 | 0.14 | 0.08 | 0.08 | ||||
| Coverage | 0.95 | 0.00 | 0.06 | 0.81 | 0.82 | 0.94 | 1.00 | ||
| Int. Length | 0.26 | 0.27 | 0.02 | 0.22 | 0.22 | 0.30 | 0.45 | ||
| True | All | Double | Lasso | PL | LCV | ZB | TU(1) | TU(10) | |
| A. RegCoef | |||||||||
| Bias | 0.04 | 0.01 | 0.84 | -0.09 | -0.08 | -0.14 | |||
| Std. Dev. | 0.63 | 0.74 | 0.67 | 0.02 | 0.18 | 0.55 | |||
| RMSE | 0.63 | 0.74 | 1.07 | 0.09 | 0.19 | 0.57 | |||
| Coverage | 0.94 | 0.92 | 0.67 | 0.02 | 0.01 | 0.64 | 0.79 | 0.99 | 1.00 |
| Int. Length | 2.25 | 2.61 | 2.33 | 0.04 | 0.03 | 1.51 | 1.22 | 2.12 | 4.25 |
| B. TE | |||||||||
| Bias | 0.02 | 0.01 | 0.12 | 0.13 | 0.13 | ||||
| Std. Dev. | 0.21 | 1.57 | 0.12 | 0.27 | 0.45 | ||||
| RMSE | 0.21 | 1.57 | 0.17 | 0.30 | 0.47 | ||||
| Coverage | 0.94 | 0.92 | 0.87 | 0.76 | 0.97 | 0.91 | 0.99 | 1.00 | |
| Int. Length | 0.78 | 5.79 | 0.56 | 0.58 | 1.88 | 19.57 | 2.16 | 6.72 | |
| C. PI | |||||||||
| Bias | 0.02 | 0.31 | -0.09 | -0.07 | -0.02 | ||||
| Std. Dev. | 0.10 | 0.10 | 0.11 | 0.11 | 0.11 | ||||
| RMSE | 0.10 | 0.33 | 0.14 | 0.13 | 0.11 | ||||
| Coverage | 0.95 | 0.06 | 0.86 | 0.87 | 0.90 | 0.95 | 1.00 | ||
| Int. Length | 0.40 | 0.36 | 0.44 | 0.43 | 0.39 | 0.50 | 0.74 | ||
| True | All | Double | Lasso | PL | LCV | ZB | TU(1) | TU(10) | |
| A. RegCoef | |||||||||
| Bias | 0.05 | 0.04 | 0.46 | -0.07 | -0.07 | -0.12 | |||
| Std. Dev. | 0.57 | 0.71 | 0.56 | 0.00 | 0.04 | 0.37 | |||
| RMSE | 0.58 | 0.71 | 0.73 | 0.07 | 0.08 | 0.39 | |||
| Coverage | 0.92 | 0.92 | 0.82 | 0.00 | 0.00 | 0.47 | 0.83 | 0.99 | 1.00 |
| Int. Length | 2.04 | 2.44 | 2.05 | 0.00 | 0.01 | 0.85 | 1.50 | 1.36 | 3.86 |
| B. TE | |||||||||
| Bias | 0.03 | -0.04 | -0.38 | -0.51 | -0.52 | ||||
| Std. Dev. | 0.41 | 1.60 | 0.15 | 0.32 | 0.45 | ||||
| RMSE | 0.41 | 1.60 | 0.41 | 0.60 | 0.69 | ||||
| Coverage | 0.91 | 0.92 | 0.26 | 0.14 | 0.73 | 0.92 | 0.91 | 1.00 | |
| Int. Length | 1.41 | 5.73 | 0.62 | 0.60 | 1.83 | 56.53 | 2.28 | 6.84 | |
| C. PI | |||||||||
| Bias | 0.04 | 0.34 | -0.12 | -0.08 | -0.03 | ||||
| Std. Dev. | 0.06 | 0.08 | 0.01 | 0.06 | 0.07 | ||||
| RMSE | 0.07 | 0.35 | 0.12 | 0.10 | 0.07 | ||||
| Coverage | 0.94 | 0.00 | 0.07 | 0.44 | 0.74 | 0.94 | 1.00 | ||
| Int. Length | 0.25 | 0.29 | 0.02 | 0.12 | 0.19 | 0.32 | 0.54 | ||
| True | All | Double | Lasso | PL | LCV | ZB | TU(1) | TU(10) | |
| A. RegCoef | |||||||||
| Bias | -0.04 | 0.04 | -0.12 | -0.19 | -0.44 | ||||
| Std. Dev. | 0.69 | 0.64 | 0.12 | 0.38 | 0.47 | ||||
| RMSE | 0.69 | 0.64 | 0.17 | 0.42 | 0.65 | ||||
| Coverage | 0.92 | 0.92 | 0.12 | 0.08 | 0.42 | 0.76 | 0.91 | 0.96 | |
| Int. Length | 2.43 | 2.22 | 0.23 | 0.28 | 1.09 | 1.01 | 1.86 | 4.21 | |
| B. TE | |||||||||
| Bias | -0.01 | 0.26 | -0.03 | -0.07 | |||||
| Std. Dev. | 0.24 | 0.15 | 0.29 | 0.34 | |||||
| RMSE | 0.24 | 0.30 | 0.29 | 0.35 | |||||
| Coverage | 0.94 | 0.60 | 0.76 | 0.98 | 0.91 | 0.99 | 1.00 | ||
| Int. Length | 0.87 | 0.67 | 0.63 | 1.86 | 2.21 | 2.12 | 7.92 | ||
| C. PI | |||||||||
| Bias | 0.00 | -0.14 | -0.02 | -0.07 | |||||
| Std. Dev. | 0.07 | 0.01 | 0.08 | 0.06 | |||||
| RMSE | 0.07 | 0.14 | 0.09 | 0.09 | |||||
| Coverage | 0.94 | 0.04 | 0.77 | 0.72 | 0.92 | 1.00 | |||
| Int. Length | 0.26 | 0.02 | 0.21 | 0.21 | 0.30 | 0.52 | |||
| True | All | Double | Lasso | PL | LCV | ZB | TU(1) | TU(10) | |
| A. RegCoef | |||||||||
| Bias | -0.03 | 0.78 | -0.09 | -0.08 | -0.15 | ||||
| Std. Dev. | 0.65 | 0.68 | 0.01 | 0.12 | 0.43 | ||||
| RMSE | 0.65 | 1.03 | 0.09 | 0.15 | 0.45 | ||||
| Coverage | 0.93 | 0.72 | 0.02 | 0.01 | 0.55 | 0.77 | 1.00 | 1.00 | |
| Int. Length | 2.25 | 2.34 | 0.02 | 0.03 | 1.13 | 1.26 | 2.04 | 4.68 | |
| B. TE | |||||||||
| Bias | -0.01 | 0.11 | 0.13 | 0.15 | |||||
| Std. Dev. | 0.22 | 0.12 | 0.25 | 0.44 | |||||
| RMSE | 0.22 | 0.17 | 0.28 | 0.46 | |||||
| Coverage | 0.93 | 0.87 | 0.74 | 0.99 | 0.98 | 1.00 | 1.00 | ||
| Int. Length | 0.77 | 0.56 | 0.58 | 2.27 | 24.82 | 2.66 | 9.98 | ||
| C. PI | |||||||||
| Bias | 0.01 | -0.09 | -0.08 | -0.04 | |||||
| Std. Dev. | 0.10 | 0.11 | 0.11 | 0.11 | |||||
| RMSE | 0.10 | 0.15 | 0.13 | 0.12 | |||||
| Coverage | 0.95 | 0.85 | 0.88 | 0.89 | 0.95 | 1.00 | |||
| Int. Length | 0.40 | 0.44 | 0.43 | 0.40 | 0.51 | 0.85 | |||
| True | All | Double | Lasso | PL | LCV | ZB | TU(1) | TU(10) | |
| A. RegCoef | |||||||||
| Bias | -0.02 | 0.40 | -0.06 | -0.06 | -0.10 | ||||
| Std. Dev. | 0.55 | 0.54 | 0.00 | 0.00 | 0.22 | ||||
| RMSE | 0.55 | 0.67 | 0.06 | 0.06 | 0.24 | ||||
| Coverage | 0.94 | 0.88 | 0.00 | 0.00 | 0.35 | 0.73 | 0.98 | 0.99 | |
| Int. Length | 2.01 | 2.04 | 0.00 | 0.00 | 0.50 | 1.60 | 1.33 | 4.15 | |
| B. TE | |||||||||
| Bias | 0.00 | -0.38 | -0.51 | -0.70 | |||||
| Std. Dev. | 0.39 | 0.14 | 0.30 | 0.41 | |||||
| RMSE | 0.39 | 0.41 | 0.60 | 0.81 | |||||
| Coverage | 0.93 | 0.26 | 0.17 | 0.77 | 0.97 | 0.94 | 1.00 | ||
| Int. Length | 1.36 | 0.63 | 0.60 | 2.19 | 71.83 | 2.76 | 9.97 | ||
| C. PI | |||||||||
| Bias | 0.04 | -0.12 | -0.08 | -0.06 | |||||
| Std. Dev. | 0.06 | 0.01 | 0.05 | 0.06 | |||||
| RMSE | 0.07 | 0.12 | 0.10 | 0.08 | |||||
| Coverage | 0.93 | 0.05 | 0.40 | 0.63 | 0.93 | 1.00 | |||
| Int. Length | 0.24 | 0.02 | 0.12 | 0.19 | 0.33 | 0.61 | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Statistical Methods and Bayesian Inference · Statistical Methods and Inference
Targeted Undersmoothing
Christian Hansen
Damian Kozbur
Sanjog Misra
Abstract
This paper proposes a post-model selection inference procedure, called targeted undersmoothing, designed to construct uniformly valid confidence sets for functionals of sparse high-dimensional models, including dense functionals that may depend on many or all elements of the high-dimensional parameter vector. The confidence sets are based on an initially selected model and two additional models which enlarge the initial model. By varying the enlargements of the initial model, one can also conduct sensitivity analysis of the strength of empirical conclusions to model selection mistakes in the initial model. We apply the procedure in two empirical examples: estimating heterogeneous treatment effects in a job training program and estimating profitability from an estimated mailing strategy in a marketing campaign. We also illustrate the procedure’s performance through simulation experiments.
JEL Codes: C12, C51, C55
keywords:
\kwd@sepmodel selection, sparsity, dense functionals, hypothesis testing, sensitivity analysis *
*The University of Chicago
Booth School of Business
5807 S. Woodlawn, Chicago, IL 60637
\safe@setrefe1@
*University of Zürich
Department of Economics
Schönberggasse 1, 8001 Zürich
\safe@setrefe2@
*The University of Chicago
Booth School of Business
5807 S. Woodlawn, Chicago, IL 60637
\safe@setrefe3@
1 Introduction
Large, complex data sets, often described by the moniker big, have opened new avenues for empirical work in economics and the social sciences. These data can be extremely rich in the sense that they contain information on a large number of variables for each observation. Such high-dimensional settings111Formally, a high-dimensional setting is an asymptotic frame for a sequence of statistical models where the number of unknown parameters grows at least as quickly as the sample size.* offer many opportunities for empirical researchers to analyze complex phenomena but pose practical and theoretical problems because of the presence of a large number of explanatory variables.*
One of the challenges created by data with many available covariates is the specification of the statistical model. With many available predictors, it is easy to specify a highly-complex model with many parameters to be estimated. Unfortunately, a statistical model with too many parameters is likely to overfit, resulting in both poor out-of-sample predictive performance and poor statistical inference about functionals that depend on the true parameters of the model. For example, informative inference about parameters in a linear regression model is impossible if the number of explanatory variables is larger than the sample size if one is unwilling to impose additional model structure.
Regularization - constraining the estimated model to avoid perfectly fitting the sample data - is therefore required for building a useful high-dimensional model. Ad hoc regularization by specifying a low-dimensional parametric model is commonly employed in empirical applications. There are also a variety of formal regularization devices that provably control over-fitting and produce high-quality forecasts under sensible conditions. However, regularization may also lead to regularization bias and underfitting - fitting a model which misses important features of the phenomenon under study - which also results in poor predictive performance and invalid inference about population objects of interest. For a systematic overview of high-dimensional methods and related issues, see **[34]**.
A popular regularizing structure in the statistics and econometrics literature is sparsity; see, for a general reference, **[16]**. Sparsity is a general term for an assumption which states that the true model depends only on a small subset of the unknown parameters. An example is the sparse linear regression model which is characterized by having many covariates, most of which have zero coefficients. A sparse estimator is an estimator which returns a model in which only a small number of estimated parameters are nonzero. There are a variety of sensible sparse estimators in the literature. Leading examples are -penalized methods such as the lasso estimator of **[32]** and **[47]**.222Alternatives to the lasso estimator with similar properties include the Dantzig selector (see [19]), forward stepwise regression (see [54], [57], [48], [28], [37] ), SCAD (see [30]), and many others. Many -penalized methods and related methods have been shown to have good estimation properties with i.i.d. data even when perfect variable selection is not feasible; see, e.g., **[19]**, **[42]**, **[14]**, **[35]**, and the references therein. Results for methods beyond simple i.i.d. data structures333See, for instance, [9] and [11].* also suggest that this type of regularization has fairly general applicability. Lasso is also useful as an input into a post-model selection estimator where statistical estimation is performed using a model selected through some statistical device; examples include [10], [9], [13], [11], [12]. This paper studies constructing inferential quantities, such as confidence intervals, for functionals of unknown model parameters in high-dimensional settings under sparsity assumptions.*
The use of regularization is problematic for statistical inference and construction of confidence sets. Confidence intervals for parameters in models which are estimated with a regularized estimator can have extremely distorted coverage probabilities if the regularization is not explicitly taken into account. Heuristically, the problem with inference arises because the regularized model may not be the true model, e.g. there may be model selection mistakes, which introduces an additional source of uncertainty. The difficulty of performing inference following regularized estimation has been documented formally by **[39]** and **[44]** among others. As a result, development of valid post-regularization inferential procedures is an important area of current research.
A leading case for which positive results regarding construction of uniformly valid inferential statements after regularization are available is for inference about low-dimensional sets of pre-specified coefficients in sparse linear regression models. Methods available in this setting include post-double selection, as in **[13]**, or debiasing, as in **[49]** and **[56]**. In each of these cases, the model of interest is given by
[TABLE]
where indexes observations, denotes sample size, is an outcome, are covariates, are idiosyncratic disturbance terms, and is an unknown parameter to be estimated with and . The goal in these papers is then to construct a confidence interval for the simple linear functional
[TABLE]
where denotes the first component of a vector.444Approaches in this setting can easily be extended to accommodate the case where the object of interest is a known, small finite-dimensional subset of the full parameter vector. Such inferential results have been extended to various settings, including panel data (see **[11]**), various nonparametric settings (see **[12]**, **[36]**), settings with generalized linear models (see **[31]**, **[12]**), and quantile regression (see **[12]**). The ideas in the **[13]** can also be generalized to estimation of parameters defined by moment conditions whenever appropriate sparsity conditions hold and Neyman orthogonalizations of the moment conditions are available; see for example **[26]**, **[22]**, and references therein. In addition, **[25]**, **[24]** and **[29]** describe how bootstrapping can be used in conjunction with some of the previously cited techniques. These bootstrapping techniques also allow control of family-wise error rates for a large number of hypothesis tests. It is worth noting that in all of these procedures, in addition to sparsity in the equation of interest, additional assumptions regarding sparsity of the relationships between the covariates are required.
The purpose of this paper is to propose and analyze a simple post-model-selection inferential procedure, targeted undersmoothing, which is applicable for inference about defined by a general class of functionals
[TABLE]
under a single sparsity condition on the model of interest with data generating process . Importantly, the class of functionals we consider may be dense, in the sense that they depend non-trivially on the entire high-dimensional parameter vector, may depend on the process generating the observations , and may correspond to objects that are not estimable. Examples of such functionals are (i) the conditional mean of at a particular point , , in a linear model and (ii) a heterogeneous treatment effect for an individual given a high-dimensional vector of characteristics of that individual. More generally, the approach we propose provides a procedure that may be used to obtain inferential statements about a large class of functionals that are of interest to economists such as marginal effects, elasticities, and counterfactual quantities of interest such as profits and welfare.
Our proposal is to form confidence sets for by starting with a typical confidence interval obtained from an initially selected model and then systematically enlarging the interval by perturbing the model to account for possible model selection mistakes. More formally, our proposed confidence set is constructed as the union of standard statistical confidence sets based on the convex hull of , where denotes a confidence region for based on a model under the assumption that is the correct model. and are in turn models selected from the data based on
An initially selected model chosen via a standard method targeting model fit to the data.
- 2.
Two additionally selected models: an upper model and a lower model, chosen by respectively targeting worst-case upper and lower bounds on the functional of interest that can be achieved by small augmentations to the model .
In practice, the initial model selection is performed with a standard high dimensional estimator like lasso. The subsequent model selection steps depend on the functional of interest and target the behavior of that functional accommodating model selection mistakes made in the first step. The subsequent steps are important since mistakes are inherent to all model selection procedures unless unrealistic conditions are imposed on the formal setting.555Such conditions include -min conditions, which assert that nonzero unknown parameters must be bounded uniformly away from zero in absolute value. In this paper, when discussing model selection mistakes, we mean variables such that . Note that model selection mistakes are captured by the set . We let denote and denote . We make the strong but important assumption that the researcher has a known upper bound, , on the number of possible model selection mistakes, .
We note that the properties of for a given model selection procedure like lasso may be difficult to calculate. A second option for choosing exists when a researcher is willing to assume a value for but is unwilling to make assumptions about . In this case, a simple and valid choice for is . Note that by construction, which immediately gives .666In practice, a situation could easily arise where if is taken to be a bound on . This situation can occur because typical bounds on the behavior of lasso imply that and not necessarily that ; see [14] and other references on lasso cited above. In light of this possibility, bounds on may be more desirable in practice even though such bounds depend on random quantities.
When constructing and as above, the two conditions and are enforced. Enforcing these conditions ensures that all involved selected sets are relatively sparse, which is important for good performance in practice, and that, in theory, the second round of selection is sufficient to capture any selection mistakes made in the first step and capture the true model.
The name ‘targeted undersmoothing’ is motivated by a useful, though informal, heuristic analogy between high-dimensional estimation and nonparametric estimation. A key problem in nonparametric regression estimation is to choose a bandwidth (for kernel-based estimates) or a set of approximating functions (in series- or sieve-based methods). Sufficiently small bandwidths and more flexible sets of approximating functions each lead to undersmoothing in estimating the target function in the sense that bias bias may be taken to be small relative to sampling variation. Undersmoothing can thus be used to justify inference based on correctly-centered Gaussian approximations. For a review, see **[40]**. Choosing a bandwidth or set of approximating functions is not unlike choosing a penalty parameter in -penalized regression where smaller values of the penalty parameter result in more complex models.
Unfortunately, simply decreasing the penalty parameter in penalized estimation of a sparse high-dimensional model does not alleviate bias in the same way as decreasing a bandwidth in a traditional kernel problem due to the complexity of the model space inherent in high-dimensional problems. Heuristically, moderate strength signals whose exclusion leads to bias are hard to pick out from among the many irrelevant variables; and as the penalty parameter is lowered beyond theoretically justified levels, it is likely that the first variables to enter the model will be irrelevant signals that happen to be moderately correlated to the outcome in the sample at hand. In this case, the decrease of the penalty parameter does not alleviate bias by introducing variables with moderate, but non-zero, coefficients that were previously missed and simultaneously introduces a type of endogeneity bias as those irrelevant variables that are introduced are precisely those with the highest correlation to the noise within the current sample. Intuitively, the targeted undersmoothing approach addresses this problem by undersmoothing in those directions that seem to be most likely to account for bias by directly focusing on the functional of interest rather than model fit.
Our paper complements several interesting papers that look at similar problems. The work in **[22]** develops general theory for a procedure for inference about a relatively low-dimensional set of prespecified target parameters when machine learning is used to estimate some features of the model under weak conditions. **[53]** study asymptotically Gaussian inference for heterogeneous treatment effects using random forests, and the ideas of **[53]** are extended to other objects of interest in **[5]**. Relative to the present work, the formal results in **[53]** and **[5]** are developed in settings with low-dimensional controls. **[6]** study estimation of heterogeneous treatment effects in conjunction with machine learning; see also **[7]**. Inference in **[6]** relies on tree-based methods and sample-splitting where part of the sample is used to learn the splitting rule for the tree and the other part of the sample is used to do inference for heterogeneous treatment effects conditional on the tree learned in the first subsample. **[4]** perform residual rebalancing to estimate average treatment effects with high dimensional control variables when regression equations are given by sparse linear models under very weak restrictions on the propensity score model that include cases where the propensity score does not have a natural sparse representation. **[18]** consider construction of confidence sets for dense functionals given by for various . Perhaps the most closely related current papers are **[58]** and **[59]**. Both **[58]** and **[59]** construct hypothesis tests for objects similar to those considered in our paper via -projections of coefficient estimates to the set of coefficients consistent with the null. **[58]** only considers linear functionals while **[59]** considers general nonlinear functionals but imposes stronger sparsity conditions than those employed below. We compare the performance of the tests in **[58]** to inference based on our proposed targeted undersmoothing procedure in the simulation section of this paper.
This paper also complements recent work in selective inference. Selective inference refers to inferential techniques for parameters which depends on a model . The goal is to approximate the sampling distribution of an estimated ; i.e. to approximate the distribution of an estimator conditional on the selected model . See, for instance, **[38]**, which carries out selective inference in the high-dimensional linear model in the case that is chosen using lasso. Selective inference is a sensible analytic tool for assessing uncertainty about model parameters when the selected model will be fixed and utilized for subsequent applications. Targeted undersmoothing and selective inference are designed for different objectives. Targeted undersmoothing aims to deliver inferential statements about objects of interest as defined in the population model rather than the values of these objects after conditioning on a selected model.
The need to specify is a limitation of our proposed method. However, this limitation is not unique to this paper. Approaches to undersmoothing in the traditional nonparametric literature also rely on ad hoc decisions about exactly what one means by sufficiently small bandwidth or sufficiently flexible set of approximating functions, for example. With few exceptions, high-dimensional estimators perform well under sparsity assumptions, and perform poorly when sparsity fails.777See for instance, [33], which allows more instruments than observations but does not impose sparsity in the first stage. Furthermore, to the best of the authors’ knowledge, there are currently no reliable tests for the violation of sparsity in the statistics or econometrics literature.
Given the dependence of the proposed procedure to the ad hoc choice of , we feel that the proposed approach will be most helpful when viewed through the lens of sensitivity analysis. Specifically, one may look at how confidence regions for objects of interest change as one varies over sensible values, for example, . Because the exercise starts with a model selected through a high-quality model selection procedure, setting corresponds to this procedure producing no model selection mistakes which happens in scenarios where oracle model selection is possible; see **[30]**, **[60]** , **[17]**. As one then considers increasing , one is considering scenarios where the initial selector is allowed to have made increasingly many selection mistakes. By looking at several values for , one thus gains insight into how sensitive conclusions are to the number of model selection mistakes made by the initial selector. This approach is similar to applications of sensitivity analysis in treatment effects estimation where a variety of approaches to sensitivity analysis exist for gauging sensitivity of causal estimators to violations of underlying identifying assumptions; see, for example, **[45]** and **[41]** for textbook reviews of classic approaches.
Two examples give an illustration of the targeted undersmoothing procedure. The first example studies heterogeneous treatment effects in the Job Trainings Partnership Act of 1982. The second example studies expected profit from individually-targeted advertising strategies derived from estimates of heterogeneous treatment effects. In the first example we find that under mild assumptions on the sparsity level, it is not possible to reject the null hypothesis that the individual-specific heterogeneous treatment effect is zero for most individuals. However, we reject the null hypothesis of no heterogeneity fairly robustly, even though we cannot pin down individual effects reliably. By contrast, in the advertising example, we see that the confidence intervals for the parameters we estimate are relatively robust to different assumptions about the true underlying sparsity level. We find strong evidence suggesting heterogenous responses of individuals to direct mail advertising. We also find strong evidence that strategic mailing to individuals based on their characteristics yields substantially higher profits than either of two simple fixed mailing strategies we consider.
Finally, the paper presents a simulation study. The simulation design is motivated by the direct mailing marketing campaign example. An interesting feature of the simulation study is that using is sufficient for producing correct coverage probabilities in almost all designs, even when and as large as 16. We find that procedures which make use of model selection but rely on perfect model recovery may have seriously distorted coverage, confirming previous results in the literature.
2 Preliminaries: Rates of Convergence for Estimated Functionals of High-Dimensional Sparse Models
This section serves as a preliminary to the main proposed inferential procedure by formally deriving some simple convergence rates for estimators of various classes of functionals based on a model chosen with a formal model selection procedure. These results verify that estimators of even dense functionals based on sparse, post-model-selection estimators may have favorable statistical properties, though they do not deliver a formal inferential procedure. In Section 3, we give a procedure for constructing confidence regions around the estimates described in the present section.
2.1 Framework
Throughout, we simply write , etc, excluding from the notation. Operations throughout the analysis are performed for each . In the asymptotic analysis, all objects should be understood to belong to sequences - , etc - each indexed by .
For a sample size , consider a dataset
[TABLE]
which is a random sample jointly distributed according to a distribution supported on some subset . The random variables are the observations and are indexed by for sample size . Recall the classical definition of a statistical model is a set of distributions on F. The statistical model is well-specified if .
Often times it is convenient to associate a parameter to the set . Here, we consider an association , where and . Therefore, each value of associates to a subset of the statistical model. We assume that and that .
When the context is clear and there is no chance for confusion, we abuse notation slightly. In discussing probabilities of events , we write to mean . I.e. probabilities, unless otherwise noted, are always taken with respect to the measure of the data generating process. This reduces clutter in the presentation.
We are primarily interested in high-dimensional applications where is large compared to and thus assume sparsity: we maintain that only a small subset of the components of are nonzero. We set and we define , the number of nonzero components of the vector .888The setting and results in this paper can be extended to the case that can be decomposed into a sparse component and a small component, so that , , . In this setting, it is natural to consider estimators of which are based on model selection.
Definition 1**.**
A model selection procedure is defined by a map In addition, a model-based estimator is a map such that for and , . The composition , where is the identity, , defines a post-model selection estimator .
It is convenient to define a notion of high dimensional convergence, which depends on , , and . Let be any measurable estimator . Let denote the support of , , and let denote the number of nonzero elements of . We define to be the Euclidean norm, and to be the -normalized Euclidean norm on . The following definition is not standard, but useful in our discussion.
Definition 2**.**
The sequence , or more generally , is high-dimensionally consistent over a class of sequences if with probability and , uniformly over . We abbreviate this by writing or .
Existence of estimators will be taken as a given high level condition. Many such estimators have been proposed and analyzed in the literature; see, for example, the textbook **[16]** and references contained there. Since our interest in this paper is on inference for functionals, we do not restate sets of low-level conditions for specific estimators for brevity. Rather, we focus on understanding the extent to which sparse estimators that satisfy Definition 3 can be used to reliably estimate large classes of functionals of the unknown parameter and the observed data.
The choice to consider only estimators featuring the rate comes at a slight loss of generality, in favor of being concrete. Most standard high dimensional estimators will achieve the above rates. In other cases, the arguments can be easily adapted.
The next two subsections discuss estimation and statistical inference for general post-model-selection estimation techniques. Researchers are often interested in a functional of a statistical model. In economics, common examples of functionals of interest are average treatment effects, heterogeneous treatment effects, demand elasticities, etc. An advantage of post-model-selection estimators is that the same selected model can be used to estimate a wide range of functionals.
2.2 Explicitly defined functionals
In this first example, we consider functionals which may depend on and . We consider the entire collection , and we will be interested in understanding how well approximates in the norm.
Define the following notion of linearizable which will be useful in establishing the next theorem.
Definition 3**.**
Linearization of . For each , there is , linear, and such that for every we have
[TABLE]
In addition, define by the matrix
[TABLE]
and, for a set , set to be the largest eigenvalue of the principal submatrix of corresponding to the index set .
Theorem 1**.**
Suppose . Suppose further that are in a sequence of functionals which satisfy Definition 3. Then
[TABLE]
Proof.
. The first term is bounded by . The second term is bounded by . Noting that completes the proof. ∎
When is uniformly linearizable in the sense that , and does not blow up over subsets in the sense that for sufficiently large, then the convergence rates simplify to
When , then and . The quantities are known as maximal sparse eigenvalues. Under mild conditions on , (see **[14]**, **[9]**), the relevant sparse eigenvalues can be bounded by . In this case, the convergence rate is attained from Theorem 1. Another application relevant to the empirical examples below is of estimating heterogeneous treatment effects. Suppose can be partitioned into with the two components giving individual characteristic effects and characteristic-by-treatment interaction effects. Both empirical illustrations below have such structure. Then if , a consequence of the above theorem is .
2.3 Implicitly defined functionals
The next theorem considers a different class of functionals of the parameter . We express the target in the context of m-estimators, following e.g. **[43]** and **[26]**. We focus on estimation of . We assume that is defined as a solution to moment conditions given by a function , which takes values in .999Extension to the setting where and are finite dimensional vectors with is trivial, but requires additional notation. Explicitly, we assume our parameters are defined as a solution to
[TABLE]
One sensible estimator is obtained by using a plug-in , calculated in a previous estimation step. Then is defined via the sample moment:
[TABLE]
for some compact set which does not depend on and which contains . In the development below, we simplify notation and write
[TABLE]
We impose regularity conditions on the functions and below before giving the rates of convergence for estimated according to the above method.
Definition 4**.**
Define the following sets centered around relative to sequences :
[TABLE]
[TABLE]
Definition 5**.**
Linearization of . For each there is and , linear, such that for every , we have
[TABLE]
and uniformly over
Definition 6**.**
Uniform Stochastic Equicontinuity. We have the following bound uniformly over
[TABLE]
Definition 7**.**
Identifiability. Let The parameter is identifiable if exists and
[TABLE]
for all for some sequence .
High level conditions like those captured in Definitions 4-7 are routinely used in m-estimation problems and can be established under a variety of primitive conditions. Definition 4 simply defines appropriate local neighborhoods to the true parameters and for use in Definitions 5 and 6. Definition 6 defines a linearization of the “population” objective function . This is a relatively weak condition which importantly does not require that is smooth. Definition 7 provides a uniform law of large numbers. This condition can also be shown under weaker stochastic equicontinuity conditions like those in **[43]** with additional assumption on the data generating process (like independent observations). For example, if is smooth with probability 1, then can be defined analogously to above. In this case, the statement in the definition of stochastic equicontinuity given in Definition 6 follows under the following three conditions: (1) a classical stochastic equicontinuity assumption, ; (2) a condition on the quality of linearization where and for defined analogously to ; and (3) a uniform law of large numbers over where . Definition 7 ensures that given knowledge of the data generating process, is uniquely defined.
Finally, let denote the component of a vector . For a set , let denote a vector with components .
Theorem 2**.**
Consider . Suppose the conditions on the sets given in Definition 4 are met with . Suppose that satisfies Definitions 5-7. Then for defined above,
[TABLE]
Proof.
Note, for sufficiently large, since . Let be the rate given in the statement of the theorem. By the identifiability assumption, we have that for any ,
[TABLE]
It therefore suffices to show that . By the triangle inequality, we have that
[TABLE]
where we define , , and . is by linearity (applying the Cauchy-Schwarz inequality to the term in the linearization.) Second, is by the uniform law of large numbers that follows from the imposed conditions. Finally, by construction of the estimator, we have . Application of the uniform law of large numbers gives . Application of linearization gives .
∎
When the functional of interest is linear in , then for some . In this case, we can set . This gives , and . Furthermore, . In this sense, the size of the vector is directly related to the calculated rate of convergence. Note that a point forecast in a linear model is an example of this case.
Specializing further to the case that , note has only a single nonzero component. In this case, for every containing the element 1. The corresponding rate of convergence is . This rate is slower than the parametric rate of . Note that under certain regularity conditions, like those described in **[13]**, can be estimated at the parametric rate.
Despite the slower rates of convergence in some situations, the estimates described above do have the desirable property of simplicity. The simplicity becomes more desirable when is more complicated than a linear functional. In the simulation section of this paper, we compare estimators of using both the plug-in estimate described above, as well as a procedure based on **[13]** to quantify any potential loss in estimation quality in certain finite sample settings.
3 Targeted Undersmoothing as an Inferential Procedure
The previous sections show that many functionals of interest can be calculated accurately from a single estimated high dimensional model. In this section, we consider inference for functionals .101010In Section 2.2, we also considered an entire profile . We note here that we will be able to construct pointwise confidence regions for . Uniform confidence regions would require additional adjustment.
We make the strong but important assumption that the researcher has a known upper bound, , on the number of model selection mistakes, defined by . If the researcher has a prior assumption on , but is unwilling to to make assumptions on , one may also take . Formally, we assume with probability . As earlier, we assume that for , there is an estimator which depends on and the data . In addition, assume we can construct for each with cardinality less than , an observable random interval which will cover with a desired pre-specified frequency if . In other words, we maintain that the true model is relatively low-dimensional and that, if told the exact form of the true model, we could construct valid inferential statements for the object of interest conditional on estimating the true model.
Given these assumptions, we can define the following inferential procedure:
Algorithm 1. Targeted Undersmoothing.
Step 1.* Select a model by a fixed model selection procedure M.*
Step 2.* For each , let be an associated random interval. Select*
[TABLE]
[TABLE]
Step 3.* Set *
Algorithm 1 takes an initially selected model and then searches for deviations that include that model and add no more than extra variables. To choose how to add variables, we do not look at model fit but rather which deviation leads to the largest change in inferential statements about the parameter of interest. In the case of a confidence interval, we do this separately for the upper and lower bound of the interval. This formulation intuitively conservatively captures the worst-case impact of up to model selection mistakes on inference for the target quantity. Figure 1 gives a schematic representation of the model selection timeline corresponding to Algorithm 1.
In order to give a formal result describing the properties of the targeted undersmoothing procedure, define the following simple condition:
Definition 8**.**
The intervals , , have uniform coverage probability over if
[TABLE]
Theorem 3**.**
Consider Algorithm 1. Suppose that the intervals have uniform coverage probability over . In addition, the sparsity bound satisfies . Then
[TABLE]
Proof.
The theorem follows from . The right-hand side has bounded by by assumption. ∎
Note that when is given by M for some , then can be taken as deterministic, using , where the term corresponds to the implied bound in the definition of .
The high-level assumption that the intervals have uniform coverage probability over is stronger than the lone assumption that covers with probability . Sufficient conditions guaranteeing uniform coverage probability over are easily stated for special cases like the high-dimensional linear model. Such conditions are commonly employed in the econometrics literature (see for example **[9]**) and are characterized by (1) probabilistic lower and upper bounds on minimal and maximal sparse eigenvalues of the matrix , (2) moment conditions on the covariates and residual terms, and (3) rate conditions on and . Nevertheless, a result which uses a weaker notion than uniform coverage probability over could also be desirable.
The main problem in deriving such a result under weaker conditions stems from the fact that
[TABLE]
If is selected and contains some , then for each . One way in which this issue can be addressed is if M has the further property that there exists a fixed set such that and is bounded by asymptotically. If in addition, the sparsity bound satisfies , then the statement of the theorem, , is recovered. Informally, this condition states that the set of variables which are liable for being falsely selected into can be controlled by .111111We conjecture that in linear regression models under irrepresentability conditions on the design matrix, we may take . However, since is a user-specified tuning parameter in the first place, we do not follow this line of reasoning in this paper.
Another procedure avoiding the assumption of uniform coverage probability over could be constructed by foregoing the initial model selection procedure, and taking . This would eliminate the problem. However, we note that taking will consider models which are in no sense local to the true model. This implies that such a procedure could fail to have power against many fixed alternatives.
An alternative to the above assumption is to adopt a sample splitting strategy. We partition the set into a disjoint union of sets of equal (or approximately equal) size, uniformly at random. We perform initial model selection on Sample A. We calculate using only Sample B. Formally, we outline the procedure here:
Algorithm 2. Targeted Undersmoothing with Sample Split.
Step 0.* Partition the sample into disjoint sets .*
Step 1.* Select a model by the model selection procedure where is the data restricted to the subsample .*
Step 2.* For each , let be the associated random interval calculated using sample . Select*
[TABLE]
[TABLE]
Step 3.* Set *
Using this procedure allows the uniform coverage probability assumption discussed above in Definition 8 to be dropped. Instead, we adopt the following:
Definition 9**.**
The intervals , , have pointwise coverage probability over if for sequences such that ,
[TABLE]
Theorem 4**.**
Consider Algorithm 2. Suppose that the intervals have pointwise coverage probability over . In addition, the sparsity bound satisfies . Then
[TABLE]
Proof.
The theorem follows from . The right-hand side has bounded by , using the fact that sample is independent of sample . ∎
Algorithm 2 will in general produce wider confidence intervals, since it is it constrained to only work with sample for inference. In our simulation study, we find that Algorithm 1 gives good coverage probabilities in all of the designs we tried.
Comment 3.1**.**
In addition to giving a procedure for constructing confidence sets, another use of targeted undersmoothing is for sensitivity analysis. Theoretical properties of targeted undersmoothing depend on unknown - and to the best of our knowledge unlearnable - . Rather than assuming is known, trying several values allows the researcher to see how sensitive confidence intervals and inference are sensitive to different values . We use this practice in the the empirical examples and the simulation exercises below.
Comment 3.2**.**
The above proposed algorithm is potentially computationally infeasible with even a moderate number of explanatory variables. Therefore, in order to implement the procedure in practice, it may be necessary to approximate the quantities .
Depending on the exact nature of the problem, different approximations or bounds might be obtained with different methods. For all of our simulation results and data applications in this paper, we add covariates indexed by into and according to a simple greedy rule. To be explicit, we perform the following algorithm:
*Algorithm 3. *Greedy Approximation for .
Initialize:
While
Set
Set
Set
Set
End
Set
Set
We note that other approximations to are also possible. For example, semidefinite relaxations can give relatively quickly computable, valid lower bounds on and upper bounds on in some cases. One could also adopt other solution techniques for obtaining approximate solutions to nonlinear integer programming problems. Further exploration of these options may be useful, though we found the simple greedy algorithm presented above to perform well relative to other options in initial simulations.
Comment 3.3**.**
It is worth noting that targeted undersmoothing can also be used to carry out hypothesis testing. This follows directly from the fact that confidence intervals can be constructed from inverted test statistics and vice versa. Suppose the hypothesis of interest is for a prespecified value . Suppose, given a model , that is an observable test statistic and that corresponds to a p-value . Then targeted undersmoothing can be used by choosing and by taking the set which makes the test most conservative (equivalently maximizing .)
4 Empirical Examples
In this section, we illustrate the use of targeted undersmoothing in two examples. First, we study effects of job training programs on wages. We are interested in estimating heterogeneous treatment effects in a setting where several individual characteristics are observed. In the second example, we are interested in making individual-specific mailing strategies and estimating the profit gain from such a strategy.
4.1 Application I: Heterogeneous Treatment Effects from JPTA
The impact of job training programs on the earnings of trainees, especially those with low income, is of interest to both policy makers and academic economists. Evaluating heterogeneous causal effect of training programs on earnings is difficult due to the fact that individual characteristics vary across the sample; it is unlikely that many individuals share exactly the same values of observed covariates. The problem is made worse the higher the dimension of the collected covariates.
We consider data available from a randomized training experiment conducted under the Job Training Partnership Act (JTPA). In the experiment, people were randomly assigned the offer of JTPA training services. Given the random assignment of the offer of treatment, we focus this exercise on estimating the average treatment effect of the offer of treatment, or the intention to treat effect, conditional of individual characteristics.In this example, we limit the analysis to the sample of adult males.
To capture the effects of training on earnings, we estimate a model of the form
[TABLE]
where indicates whether training was offered, the outcomes are earnings, is a vector of covariates which includes a constant, is an unobservable, and are parameters. Earnings are measured as total earnings over the 30 month period following the assignment into the treatment or control group, and average earnings in the sample are $19,147. Observed control variables are dummies for black and Hispanic persons, a dummy indicating high-school graduates and GED holders, five age-group dummies, a marital status dummy, a dummy indicating whether the applicant worked 12 or more weeks in the 12 months prior to the assignment, a dummy signifying that earnings data are from a second follow-up survey, and dummies for the recommended service strategy. See **[3]** for detailed information regarding data collection procedures, sample selection criteria, and institutional details of the JTPA along with additional facts and discussion about the JTPA training experiment. In all, the dataset has 5102 observations.
In this example, we are interested in estimating confidence intervals for individual specific treatment effects. We form estimates by first calculating the post-lasso estimator of the coefficients
[TABLE]
using the procedure described in the Implementation Appendix. Then for each individual , we calculate the individual-specific intent to treat effect given by
[TABLE]
There are many ways to construct regressors from the set of dummy variables available. In this example, we consider two methods to generate regressors. The first method is based on common practice in econometrics of generating interactions. The second method is based on the Hadamard-Walsh expansion121212Details about this expansion as well as some of its advantages are described in [46]* of the indicator variables described below, which generates a far larger set of regressors. A larger set of regressors has advantages in that it can make any sparsity assumptions more plausible, though the resulting analysis may suffer in terms of statistical precision due to the increased complexity of the underlying model space.*
To obtain the first construction we use for , we consider all products of the discrete variables available. That is, we adopt the common convention of including the dummy variables themselves, all first order interactions between the main dummy variables, all second order interactions, and all further higher order interactions. Excluding empty and small cells, the dimension of the covariate space is 313.131313Specifically, we start by eliminating all variables with nonzero entries in either the control or treated subsample. After these deletions, we then remove any variables if the corresponding diagonal R term in QR decomposition of the design matrix was over either the control or treated subsample. Therefore, with the treatment variable and constant, the total number of unknown parameters is 628. Though the number of observations is larger than the sample size, the number of parameters is large enough that regularized estimation would be extremely helpful in terms of obtaining informative inference about model parameters.
Figure 3 presents pointwise confidence intervals for the individual specific effects for all individuals.141414In principle, other descriptions of the treatment effect distribution can also be reported. For instance, uniform bands for the sorted effects function could be obtained by combining the results in [23] with targeted undersmoothing. We choose to present pointwise confidence intervals for simplicity. The intervals are calculated using four methods. The first panel presents estimates which use the entire set of control variables. The second panel presents oracle-style confidence intervals based on post-lasso which ignore first stage model selection. The third panel presents targeted undersmoothing estimates using . The fourth panel presents targeted undersmoothing estimates using . The targeted undersmoothing intervals are calculated with the forward selection greedy approximation described in the Section 3. In each case, we use Algorithm 1.
The figure shows that resulting confidence interval lengths using OLS estimates are quite large. The interval lengths using the oracle-style confidence intervals are comparatively very tight. Though the oracle-style intervals are expected to have poor performance in finite samples. Using we see that many of the interval lengths increase by nearly an order of magnitude. Though interestingly, there is wide variation across individuals in terms of how much the corresponding confidence interval grows. With , we see that the intervals are in some cases nearly as large as with the OLS-based intervals. For most individuals, the corresponding intervals contain zero.
Another testable hypothesis of interest is whether there is evidence of any effect heterogeneity. Within the model, testing the null hypothesis of no treatment heterogeneity is equivalent to testing . As described in the previous section, a test can be implemented using the targeted undersmoothing procedure. We implement this procedure using the standard Wald test. The results are reported in Table 1 for targeted undersmoothing using . We also report the corresponding Wald test using the entire vector of covariates (labeled OLS in the table), and an oracle-style Wald test (labeled PL in the table). We note that the OLS-based result is likely unreliable due to relying on a heteroskedasticity-consistent estimate of a large, full covariance matrix. We reject the null hypothesis for at the 5% level but fail to reject for larger . An interesting property of the hypothesis testing scheme is that the degrees of freedom stay constant. This means that the additional covariates entering the model correspond to components , and not the interaction terms .
The existence of a sparse representation of the regression function in the basis given by the interaction expansion is an important modeling assumption in the above analysis. It is possible to perform a further robustness analysis by considering more expansive models. In order to illustrate this point, we perform the analysis with an expanded set of transformations of the original dummy variables. We consider the Hadamard-Walsh basis defined as follows. Let denote the original set of indicator variables. Let each subset index a transformation of given by . In the expanded model, we include regressors of the form . In order to nest the previous analysis, we also include all of the interaction variables from the first specification.151515We choose to only include terms as potential covariates for . Note that for , the resulting transformations are perfectly correlated to the original indicator variables. The result is that , including the constant term. After interacting with the indicator , the total dimensionality of the model parameters is 5854, which exceeds the sample size .
Figure 4 presents pointwise confidence intervals for the individual specific effects for all individuals using the new, expanded set of transformations of the original variables. In this analysis, OLS is no longer feasible because the dimensionality of the model exceeds the sample size. The first panel presents oracle-style confidence intervals, which ignore first stage model selection. The estimated distribution of heterogenous effects is much smoother than that obtained in Figure 3. Interestingly, the initial model selection selects terms from both the interaction expansion and the Hadamard-Walsh expansions. The second panel presents targeted undersmoothing estimates using , and the third panel presents targeted undersmoothing estimates using . The targeted undersmoothing intervals are calculated with the forward selection greedy approximation described in the Section 3. As before, in each case, we use the single sample option described in Algorithm 1.
The figure shows that resulting oracle-style confidence intervals are similar to those in Figure 3. Both sets of interval lengths are comparatively very tight. Though, as discussed above, the oracle-style intervals are expected to have poor performance in finite samples. Using we see that many of the interval lengths increase as before. There still remains a set of individuals for whom the corresponding confidence interval excludes zero. With , for all individuals, the corresponding intervals contain zero. Though not pictured in Figure 4, we note that all intervals for individual-specific treatment effects include 0 as soon as .
Finally, we again report results for testing the null hypothesis of no treatment heterogeneity, , using the expanded model in Table 2. The procedure is implemented as before, using the standard Wald test and the results are reported in targeted undersmoothing using . We see that we reject the null hypothesis for at the 5% level but fail to reject for .
Taken together, the results in this section suggest there is mild evidence for treatment effect heterogeneity in this example. We would reject the hypothesis of no heterogeneity and also obtain some evidence for individual specific treatment effects that differ from zero when using oracle model selection results. However, we cannot rule out the possibility of no treatment effect heterogeneity after allowing for a modest number of model selection mistakes within either of the bases considered. Thus, to draw strong conclusions about treatment effect heterogeneity, one must believe that the initial model selection procedure is very close to perfect in this example.
4.2 Application II: Heterogeneous Treatment Effects in Direct Mail
The targeting of individuals with appropriate interventions that induce preferred outcomes is a relevant problem in various application areas including business, political science and economics. In the field of marketing, such targeting has been the key instrument of retailers that use direct mail as the focal intervention to inform and persuade their customers to purchase from their catalogs. These catalogs are often relatively expensive to produce and firms spend significant amounts in this endeavor.161616In 2009, the estimated spending on catalogs was $15.1B; and over 10B catalogs were mailed in 2015 ([1], [2]).
Our data for this example comes from a large multi-product retailer that sells directly to consumers online but also via mail, phone and retail channels. The firm’s budget for direct-mailed catalogs is over 1.5B. The firm routinely runs experiments to evaluate the effectiveness of its catalog mailing strategy. Typically, these experiments have two conditions (mail, no-mail) that are randomized across customers. Our data focuses on one such experiment that involved over 290,000 customers. The data also include a list of 486 descriptors of the the individual customers. These descriptors include demographic characteristics (age, income, gender, state), details of past promotional activity they may have received as well as their past consumption behavior data including purchases, the timing of such purchases, the number of orders in the past year, and the extent of their expenditures with the firm. This last set of variables are commonly referred to as RFM (Recency, Frequency and Monetary value) metrics in the direct mail industry and are commonly used variables in analyzing and predicting customer behavior. We note that the design matrix in our analysis contains 2139 columns once categorical variables are expanded.
*In our analysis, we estimate the following simple specification of a model with heterogeneous treatment effects: *
[TABLE]
In the above, is an indicator that a consumer has been randomly assigned to receive a direct mail marketing instrument (a catalog), and the are customer characteristics. are dollar expenditures by the customer over a 3-month horizon following the mailing of the marketing instrument. For notational convenience, we assume that are i.i.d. draws, having the same distribution as the generic pair of random variables .
In this exercise, we assume that the firm is interested in evaluating a marketing strategy formed from targeting individuals based on their individual-specific treatment effects versus one of two simple baseline strategies - either mailing to no one or mailing to everyone. To this end, we note that a mailing strategy assigns customers with characteristics to either receive the mailing or not. We then adopt targeted undersmoothing to provide a simple mechanism that allows the firm to statistically evaluate the difference between any two competing mailing strategies on the basis of average expected profits. The average expected profit from implementing a strategy is given by
[TABLE]
A few points about the above quantity are worth noting. First, the firm has a known margin that applies to sales generated by its customers. For simplicity, we assume that the cost to the firm of targeting each consumer, , is constant and known ex ante.171717A more general approach would be to write costs as functions of Implementing this approach would require specific data about individual mailing costs which we currently do not have. We could also assume that costs are drawn from some known distribution where the exact realization is unknown by the firm until after the mailings have been sent out and calculate expected profits integrating over this cost distribution. Within the model, there is just one remaining source of uncertainty - the unanticipated demand shocks which are only observed via outcomes - which are assumed to have conditional mean zero.
We begin by examining two extremal mailing strategies where either no customers receive a catalog (‘no-mailings’) by setting uniformly or a ‘blanket-mailing’ strategy wherein all customers receive a catalog (i.e. for all ). For the no-mailings strategy expected profits are
[TABLE]
Similarly, the expected profit for the blanket mailing strategy can be written as
[TABLE]
A sophisticated firm might be interested in optimizing the mailing strategy based on expected consumer response.181818See [8] and [50] for interesting approaches to estimating and performing inference for optimal treatment strategies. One simple, sensible mailing strategy would be to mail to a consumer with characteristics whenever the expected increment in profits for that customer exceeds costs. The rule can be described by
[TABLE]
*Using this strategy, we then have expected per consumer profit of *
[TABLE]
Now suppose we wish to compare the targeted strategy to the ‘blanket’ or ‘no-mailing’ strategies. We can describe the difference in profit between the targeted and no-mailing strategies as
[TABLE]
Similary, the difference between the targeted and blanket strategies would be
[TABLE]
We note that both of the expected per-person profit differentials capture the benefits due to cost savings and lost revenues of targeting based on expected treatment effects. Relative to targeting no one, targeting based on anticipated treatment effect has the potential to increase revenue at the cost of paying the treatment cost for the targeted individuals. Relative to treating everyone, targeting based on anticipated revenues has the potential to decrease costs by not targeting individuals for whom the treatment is anticipated to be ineffective.
Simple natural estimators exist for both and The natural estimator for is
[TABLE]
*for some estimator . Similarly, a natural estimator of is *
[TABLE]
for an estimator . Under the sparsity assumptions on the true model maintained in this paper and conventional regularity conditions, and will by asymptotically normal with standard error that can be estimated via the delta-method when is estimated from the true model. Based on this observation, we can apply the targeted undersmoothing approach to conduct inference on potential profit improvements from targeting based on the rule relative to the two simple baseline strategies.
We present estimates and targeted undersmoothing confidence intervals for and in Tables 3 and 4 respectively.191919As with the JTPA example, before any estimation is done, variables with a very small number of nonzero observations are excluded. In the first pass, variables with nonzero entries in the entire sample were eliminated. In the second pass, variables were eliminated if the corresponding diagonal R term in the design matrix QR decomposition was over either control or treated subsample. In all calculations, the margin parameter is set to and the cost parameter is set at based on input from the firm. We first report OLS-based estimates, which use all covariates. In addition, we report oracle-style post-lasso estimates as well as targeted undersmoothing estimates for . We implement the first stage model selection using the procedure in Appendix 1. We use heteroskedasticity consistent standard errors and calculate confidence intervals using the delta method.
We see that the confidence intervals for the parameters and are very robust to different assumptions about the true underlying sparsity level . Interestingly, the OLS-based intervals are completely different from the targeted undersmoothing intervals for every value of reported. This difference is likely due to a failure of OLS in this example. In the setting of the simulation study below, we find that OLS-based intervals achieve poor coverage probabilities with coverages as low as 0.00% in some settings. The poor performance of OLS in the simulation study is due to biases arising from taking a nonlinear transformation of the estimated coefficient vector and a failure of the standard delta method with a large number of covariates.202020Bias corrections for the delta method in settings with many covariates are described in [21]. For simplicity, we report the estimates and intervals which correspond to common practice. In this example, the OLS-based estimates seem to overstate both and .
Finally, we test the hypothesis in Table 5. As in the previous example, this hypothesis corresponds to the hypothesis of no treatment effect heterogeneity. From a policy standpoint, understanding whether there is evidence for treatment effect heterogeneity may be interesting as there is clearly no gain from any targeting strategy based on observables if the treatment effect is constant across these observables. The results for testing this hypothesis are presented in Table 5. We note that the OLS-based result is likely unreliable due to relying on a heteroskedasticity-consistent estimate of a large, full covariance matrix, but we report the result for completeness. In this example, we see that the p-values are very near zero for all considered values of , suggesting that there is strong evidence against the hypothesis of no treatment effect heterogeneity that is robust to fairly large deviations from the initially selected model. As in the previous example, we also see that the degrees of freedom of the test is constant across the different values of indicating that the additional variables being added all enter the model via the term. Adding variables to this part of the model that are correlated to the estimated treatment effect reduces the signal available to learn about treatment effect heterogeneity and thus intuitively provides “worst-case” deviations from the standpoint of drawing conclusions about the existence of this heterogeneity.
5 Simulation Study
In this section, we present a simulation study designed to demonstrate the properties of the proposed procedure in finite samples. We consider six simulation designs based on the example in Section 4.2. We generate data for each simulation replication as iid draws for from the model
[TABLE]
[TABLE]
where is a constant that is chosen so that the population of the regression of onto is , is an vector of ones, is an vector of zeros, is a vector with element given by , and denotes the Hadamard product. The six considered simulation designs are based on varying and . In all simulations, we take . We note that the process for the is meant to approximate what we see in the observables in the example in Section 4.2 which are all positive with large fractions of observations exactly at 0. For each simulation design, we estimate and construct confidence sets for three functionals: (1) the value of a single coefficient (specifically ), (2) an individual treatment effect for a fixed hypothetical subject (with ), and (3) the average per-person profit differential from a targeting rule based on estimated individual specific treatment effects and a rule which treats no one (* defined in Section 4.2).*
For each set of model parameters, we simulate 500 replications and present the properties of several estimators:
True.* An infeasible estimator based on ordinary least squares on the correct support of the underlying model.*
- 2.
All.* An estimator based on ordinary least squares using all covariates.*
- 3.
Double.* The post-double estimator as described in [13]*
- 4.
Lasso.* An estimator based on lasso. Standard errors computed using lasso residuals.*
- 5.
PL.* An estimator based on the post-lasso estimator of [10]. Standard errors computed using post-lasso residuals.*
- 6.
LCV.* An estimator based on lasso with penalty level chosen by 10-fold cross validation. Standard errors are computed using lasso residuals.*
- 7.
ZB.* Confidence intervals based on inverting the hypothesis test prosed in [58].*
- 8.
TU(1).* Targeted undersmoothing with using Algorithms 1 and 3. Initial model description in Implementation Appendix.*
- 9.
TU(10).* Targeted undersmoothing with using Algorithms 1 and 3. Initial model description in Implementation Appendix.*
All standard errors are computed using conventional heteroskedasticity consistent standard errors (e.g. **[55]**) using the estimated residuals indicated above. We give details on implementation specifics in the following paragraphs.212121There are many choices about how to implement the different procedures, e.g. whether to split into treatment and control observations and which penalty parameters to use. The choices below were based on initial simulations where they seemed to produce the most favorable performance for the non-targeted undersmoothing approaches.
For True, All, and Double, we directly estimate the model above. For Double, we apply **[13]** with a minor modification. We implement the relevant lasso regressions from **[13]** using the modified heteroskedastic lasso of Appendix 1.
To implement lasso, PL, we use the implementation given in Appendix 1 to select a model. The PL estimates re-estimate coefficients by applying OLS with only the variables selected by lasso. For LCV, we use a modification of the procedure in Appendix 1, where 10-fold cross-validation within each subset is used to choose the tuning parameter to use in that subset. We then apply the conventional lasso within each subset based on these estimated tuning parameters. For these methods, we then can obtain estimates and standard errors for the functionals of interest in the obvious manner. ZB implements the proposed method of inference for dense linear functionals of a parameter vector from **[58]**. Finally, the PL model serves as our initial model when applying targeted undersmoothing. We apply targeted undersmoothing for .
To measure the performance of the nine procedures, we report estimates of bias, standard deviation, root mean-square error (RMSE), coverage probability for a 95% confidence interval, and corresponding confidence interval length from the simulation in Tables Sim1-Sim6 and Figures Sim1-Sim6. In the figures, we provide average confidence interval lengths and coverage probabilities along the 10-steps of the forward selection path produced in the simulation. As a benchmark, we superimpose coverage probabilities and interval lengths for the infeasible ‘True’ estimator which knows the correct model on the targeted undersmoothing path plots.
The ‘True’ estimator provides an infeasible benchmark which serves as a basis for comparison. In most simulations, the ‘True’ estimator achieves the target 95% coverage probability. In general, the ‘True’ estimator also achieves the smallest bias, RMSE, and shortest confidence intervals. All other estimators provide feasible alternatives that ideally would approximate the behavior of this infeasible benchmark.
When the number of parameters to be estimated is smaller than the sample size, a simple feasible option is to estimate the full-model without any model selection. In terms of our simulation, this approach clearly results in small bias for the individual regression parameter and for the individual-specific treatment effect as both of these objects are linear combinations of the regression coefficients and the variables in the design are mean-independent of the error term. The cost of estimating the full model is decreased estimation precision as evidenced by relatively large standard deviation and RMSE relative to the other point estimators. We also see that the confidence intervals produced after estimating the full model are relatively long, often longer than the intervals resulted from targeted undersmoothing with small or moderate . The most interesting feature of the results based on the full model are for estimating the profit differential. For this object, the estimator is dominated by bias due to the profit differential depending nonlinearly on the model parameters and the imprecision in estimating these parameters. This bias then results in very poor coverage properties for the true profit differential. This behavior can be viewed as a failure of the delta-method in moderate or high-dimensional models; see **[20]**. We suspect this behavior will carry over to many nonlinear settings.
We next examine the performance of ‘Lasso’ and ‘PL’. We note that the lasso penalty parameter in this case is set in a manner that theoretically provides lasso with an optimal rate of convergence and guarantees that the . We then conduct inference in these cases by relying on oracle-type results (see for example **[60]**, **[17]**) that ignore the first step model selection. These estimators behave roughly as expected by theory. In general, the estimators are competitive in terms of RMSE for all objects considered across all different designs. However, their bias also tends to be comparable to their standard deviation due to regularization and model selection mistakes. Oracle-style approximations do not account explicitly for this remaining bias due to regularization and as a result do not achieve correct coverage rates. We note that these distortions can be severe. Coverage for these procedures is generally far from the nominal 95%; and in some cases, the estimators have 0% coverage. We note that targeted undersmoothing is expressly designed to offer a generic approach to address the presence of this bias.
The ‘LCV’ estimator is similar to ‘Lasso’ and ‘PL’ in that it applies oracle-style inference after selecting a model from the data. The difference is that cross-validation tends to produce penalty parameters that are much smaller than the theoretically motivated values used in ‘Lasso’ and ‘PL’. This reduction in the penalty parameter allows extra variables to enter the model relative to the case where the larger penalty parameters are used. In this sense, such a procedure can also be thought of as an undersmoothing procedure, though the “undersmoothing” is targeted toward model fit.222222[27] demonstrates that cross-validation may produce estimates with slower than optimal convergence rates with models that are much too complex in the sense that . In these simulations, we see that LCV tends to produce estimates of the regression coefficient and individual-specific treatment effect with bias similar to that obtained with lasso and PL, though LCV also tends to have a larger standard deviation than these estimators as well. The similar bias and larger standard deviation results in LCV tending to be outperformed in terms of RMSE for these objects but also results in better coverage properties of the LCV intervals than the lasso or PL intervals - though LCV coverage still tends to be far from the nominal level.232323Exceptions are coverage of the individual specific treatment effect in Tables Sim1, Sim2, Sim4, Sim5. For the profit differential, LCV is less-biased than lasso in all cases and less-biased than PL in four of six cases while generally having similar standard deviation. Thus, LCV is competitive in terms of RMSE for this object. However, sufficient bias remains for confidence intervals to remain substantively distorted, producing coverage probabilities for the profit differential that range between 0.63 and 0.90.
In many studies, the object of interest is an inherently low-dimensional parameter, such as a single regression coefficient or an average treatment effect, and semi-parametric estimation can be designed that specifically targets this low-dimensional parameter of interest.242424See, for example, [15], [52], [43], [51] for classic examples. [22] provide a recent treatment in a high-dimensional setting. This approach is adopted in the high-dimensional linear model setting in **[13]**, **[49]** and **[56]** for estimating a single regression coefficient of interest. For regression coefficients, these procedures are -consistent and semi-parametrically efficient within the model considered in the simulation. They also theoretically deliver uniformly valid inference over large classes of models which include cases where perfect model selection is theoretically impossible. In terms of our simulations, this approach does relatively well in the case, delivering performance which is comparable to the infeasible oracle. However, in the and cases, the point estimator has a large bias which translates into relatively poor coverage properties.252525The behavior may be improved by considering double machine learning as defined in [22], which relies on weaker sparsity conditions than [13]. We note that targeted undersmoothing offers an approach to gauging the sensitivity of conclusions to model selection mistakes and could be applied directly to semiparametric targets using orthogonal estimating equations as in [13] or [22]. We do not pursue this direction further in this paper for brevity.
The ‘ZB’ method does not achieve 95% coverage for the regression coefficient in any of the simulation designs considered here (with coverages ranging from 73% to 83%). The ZB method gives better coverage probabilities for the individual treatment effect with near or above 95% coverage in all simulation designs. The lengths of the ZB confidence intervals grow considerably with the underlying value of . For instance, in the case, the mean ZB interval length is 3.41 while the ‘True’ mean interval length of 0.88; in the case, the mean ZB interval length is 56.53 while the ‘True’ mean interval length of 1.41.
We now look at intervals constructed using the targeted undersmoothing approach. Note that we take the initial model to be that underlying PL in these simulations, and, for point estimation, one could use these PL point estimates. The point of targeted undersmoothing is to provide valid inferential statements allowing for model selection mistakes in producing this initial model and corresponding point estimates. An interesting feature of the presented simulations is that TU(1) achieves nearly correct coverage uniformly across the simulation designs - achieving higher than 90% coverage in every design. While not reported in the table, we also have that TU(2) achieves higher than 95% coverages in all cases. We do see the inherent conservativeness in sensitivity analysis considering a large class of models in that TU(10) uniformly has coverage greater than 95%, with coverage of 100% in most cases. Importantly, the good coverage properties are uniform across all designs and all parameters considered. Unsurprisingly, this robustness comes with a cost. As must be the case, the intervals produced by the targeted undersmoothing approach are relatively wide and become wider as one allows for more selection mistakes. However, the losses relative to the infeasible optimum are modest for small and that the intervals are still potentially informative even in the most extreme case we consider.
Overall, we believe these results are favorable to the targeted undersmoothing approach. Of the considered feasible alternatives, it is the only procedure that produces uniformly good coverage properties, at the cost of increased imprecision about what conclusions can be drawn from the data. This increase in imprecision seems honest as it reflects the potential for substantive biases resulting from model selection mistakes. The procedure is also anchored on initial point estimates that have relatively good properties for estimating the parameters of interest.
6 Conclusion
In this paper, we have considered post model selection inference for a large class of functionals of the underlying model. Our procedure provides valid confidence sets while handling the possibility that a misspecified model was selected. We show that these methods perform well in a simulation study. We illustrate their use in estimating the profit differential for a fixed coupon-mailing strategy and in estimating heterogeneous treatment effects in data from a job training experiment.
Appendix 1. Implementation Details
This appendix describes the model selection procedure implemented in several sections of the paper. Recall that the general model estimated is given by
[TABLE]
The procedure for selecting is as follows.
Algorithm A1. Initial model selection in heterogeneous effects linear model.
Step 1.* Divide the sample into two sets: and .*
Step 2.* Within each sample, demean the observations.*
Step 3.* Using the demeaned observations, run the modified heteroskedastic lasso regression (described below in Algorithm 2) of on over subset and let be the set of covariates selected. Again using the demeaned observations, run the modified heteroskedastic lasso regression of on over subset and let be the set of covariates selected.*
Step 4.* The final model consists of the constant term, the main effect of , the components corresponding to covariate indexes in , and the interaction terms ( terms) corresponding to covariate indexes in .*
Algorithm A2. Modified Heteroskedastic Lasso: Marginal Correlation-Based Initial Penalty Loadings. The modified heteroskedastic lasso is identical to **[9]** with a small modification. **[9]** relies on ‘initial penalty loadings,’ which require initial estimates of individual specific residuals. To obtain initial estimates of residuals, , we regress on the 5 covariates with the highest marginal correlation with and use the resulting residuals. This approach can be shown to be formally valid when the number of covariates with high marginal correlations to used is bounded by a constant which does not depend on . In contrast, note that **[9]** suggest . Finally, the penalty loadings are updated with one iteration as described in **[9]**.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Catalogs, after years of decline, are revamped for changing times, [ http://www.nytimes.com/2015/01/26/business/media/catalogs-after-years-of-decline-are-revamped-for-changing-times.html].
- 2[2] The high costs of catalog retailing [http://www.adweek.com/news/advertising-branding/high-costs-catalog-retailing-101958].
- 3[3] Alberto Abadie, Joshua Angrist, and Guido Imbens. Instrumental variables estimates of the effect of subsidized training on the quantiles of trainee earnings. Econometrica , 70(1):91–117, 2002.
- 4[4] S. Athey, G. W. Imbens, and S. Wager. Approximate Residual Balancing: De-Biased Inference of Average Treatment Effects in High Dimensions. Ar Xiv e-prints , April 2016.
- 5[5] S. Athey, J. Tibshirani, and S. Wager. Generalized Random Forests. Ar Xiv e-prints , October 2016.
- 6[6] Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences , 113(27):7353–7360, 2016.
- 7[7] Susan Athey, Guido Imbens, Thai Pham, and Stefan Wager. Estimating average treatment effects: Supplementary analyses and remaining challenges. American Economic Review , 107(5):278–81, May 2017.
- 8[8] Susan Athey and Stefan Wager. Efficient policy learning. Ar Xiv e-prints , February 2017.
