Scalable simultaneous inference in high-dimensional linear regression models
Tom Boot, Didier Nibbering

TL;DR
This paper introduces a computationally efficient method for simultaneous inference in high-dimensional linear regression models using the Moore-Penrose pseudoinverse, enabling scalable analysis with reliable coverage.
Contribution
It proposes a novel, fast inference method based on the Moore-Penrose pseudoinverse that is free of tuning parameters and effective under certain symmetry assumptions.
Findings
Achieves near-nominal coverage in simulations
Computational complexity is significantly reduced
Regularization can improve efficiency
Abstract
The computational complexity of simultaneous inference methods in high-dimensional linear regression models quickly increases with the number variables. This paper proposes a computationally efficient method based on the Moore-Penrose pseudoinverse. Under a symmetry assumption on the available regressors, the estimators are normally distributed and accompanied by a closed-form expression for the standard errors that is free of tuning parameters. We study the numerical performance in Monte Carlo experiments that mimic the size of modern applications for which existing methods are computationally infeasible. We find close to nominal coverage, even in settings where the imposed symmetry assumption does not hold. Regularization of the pseudoinverse via a ridge adjustment is shown to yield possible efficiency gains.
| 1: Independent predictors | ||
| 2-4: Equicorrelated predictors | ||
| 5-7: Autoregressive correlation | ||
| 8-10: Factor models | ||
| 11-13: Group structure | ||
| 14: Extreme correlation | ||
| MAE | CR | power | MAE | CR | power | MAE | CR | power | ||
|---|---|---|---|---|---|---|---|---|---|---|
| method | Equicorrelated with | |||||||||
| MPI | 3.66 | 0.94 | 0.19 | 2.11 | 0.94 | 0.43 | 1.39 | 0.94 | 0.79 | |
| 3.51 | 0.95 | 1.90 | 0.96 | 1.23 | 0.96 | |||||
| RID | 2.86 | 0.94 | 0.26 | 2.00 | 0.94 | 0.47 | 1.38 | 0.94 | 0.80 | |
| 2.60 | 0.96 | 1.76 | 0.96 | 1.21 | 0.96 | |||||
| GBRD | 2.68 | 0.93 | 0.28 | 1.96 | 0.93 | 0.48 | ||||
| 2.25 | 0.96 | 1.61 | 0.96 | |||||||
| JM | 2.99 | 0.31 | 0.39 | 2.80 | 0.25 | 0.51 | ||||
| 0.80 | 0.95 | 0.61 | 0.98 | |||||||
| Factor model with | ||||||||||
| MPI | 3.87 | 0.97 | 0.13 | 2.26 | 0.96 | 0.37 | 1.47 | 0.96 | 0.68 | |
| 3.76 | 0.98 | 2.10 | 0.97 | 1.37 | 0.97 | |||||
| RID | 2.57 | 0.94 | 0.27 | 1.98 | 0.94 | 0.44 | 1.44 | 0.95 | 0.70 | |
| 2.06 | 0.98 | 1.66 | 0.97 | 1.28 | 0.97 | |||||
| GBRD | 2.66 | 0.92 | 0.27 | 2.09 | 0.88 | 0.44 | ||||
| 1.99 | 0.98 | 1.45 | 0.97 | |||||||
| JM | 3.40 | 0.15 | 0.43 | 3.35 | 0.10 | 0.50 | ||||
| 0.36 | 0.96 | 0.33 | 0.98 | |||||||
| Autoregressive with | ||||||||||
| MPI | 0.64 | 0.97 | 0.65 | 0.34 | 0.95 | 0.98 | 0.21 | 0.91 | 1.00 | |
| 0.61 | 0.98 | 0.29 | 0.97 | 0.17 | 0.97 | |||||
| RID | 0.49 | 0.95 | 0.83 | 0.31 | 0.95 | 0.99 | 0.21 | 0.91 | 1.00 | |
| 0.43 | 0.98 | 0.27 | 0.97 | 0.17 | 0.97 | |||||
| GBRD | 0.47 | 0.95 | 0.86 | 0.31 | 0.95 | 0.98 | ||||
| 0.39 | 0.98 | 0.28 | 0.97 | |||||||
| JM | 0.48 | 0.67 | 0.96 | 0.33 | 0.67 | 1.00 | ||||
| 0.20 | 0.97 | 0.15 | 0.97 | |||||||
| (p,n)=(200,100) | (p,n)=(1000,200) | (p,n)=(10000,400) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MAE | CR | power | MAE | CR | power | MAE | CR | power | ||
| method | Independent | |||||||||
| MPI | 0.70 | 0.98 | 0.63 | 0.38 | 0.96 | 0.97 | 0.22 | 0.95 | 1.00 | |
| 0.64 | 0.99 | 0.30 | 0.99 | 0.18 | 0.98 | |||||
| RID | 0.62 | 0.97 | 0.77 | 0.37 | 0.95 | 0.97 | 0.22 | 0.95 | 1.00 | |
| 0.52 | 0.99 | 0.29 | 0.99 | 0.17 | 0.98 | |||||
| GBRD | 0.58 | 0.96 | 0.85 | 0.36 | 0.94 | 0.98 | ||||
| 0.43 | 0.99 | 0.26 | 0.99 | |||||||
| JM | 0.66 | 0.84 | 0.89 | 0.41 | 0.83 | 0.99 | ||||
| 0.33 | 0.98 | 0.21 | 0.99 | |||||||
| Equicorrelated with | ||||||||||
| MPI | 2.31 | 0.95 | 0.37 | 1.38 | 0.93 | 0.81 | 0.90 | 0.91 | 0.99 | |
| 2.11 | 0.97 | 1.15 | 0.97 | 0.74 | 0.96 | |||||
| RID | 1.94 | 0.93 | 0.51 | 1.34 | 0.92 | 0.84 | 0.90 | 0.91 | 0.99 | |
| 1.65 | 0.97 | 1.08 | 0.97 | 0.73 | 0.96 | |||||
| GBRD | 1.82 | 0.91 | 0.59 | 1.32 | 0.90 | 0.86 | ||||
| 1.37 | 0.97 | 0.98 | 0.97 | |||||||
| JM | 2.00 | 0.59 | 0.72 | 1.71 | 0.51 | 0.91 | ||||
| 0.82 | 0.95 | 0.57 | 0.97 | |||||||
| Equicorrelated with | ||||||||||
| MPI | 3.66 | 0.94 | 0.19 | 2.11 | 0.94 | 0.43 | 1.39 | 0.94 | 0.79 | |
| 3.51 | 0.95 | 1.90 | 0.96 | 1.23 | 0.96 | |||||
| RID | 2.86 | 0.94 | 0.26 | 2.00 | 0.94 | 0.47 | 1.38 | 0.94 | 0.80 | |
| 2.60 | 0.96 | 1.76 | 0.96 | 1.21 | 0.96 | |||||
| GBRD | 2.68 | 0.93 | 0.28 | 1.96 | 0.93 | 0.48 | ||||
| 2.25 | 0.96 | 1.61 | 0.96 | |||||||
| JM | 2.99 | 0.31 | 0.39 | 2.80 | 0.25 | 0.51 | ||||
| 0.80 | 0.95 | 0.61 | 0.98 | |||||||
| Equicorrelated with | ||||||||||
| MPI | 8.37 | 0.95 | 0.08 | 4.60 | 0.96 | 0.12 | 2.90 | 0.96 | 0.23 | |
| 8.28 | 0.96 | 4.53 | 0.96 | 2.92 | 0.96 | |||||
| RID | 5.70 | 0.95 | 0.09 | 4.07 | 0.96 | 0.12 | 2.83 | 0.96 | 0.23 | |
| 5.56 | 0.96 | 3.99 | 0.96 | 2.85 | 0.96 | |||||
| GBRD | 5.49 | 0.95 | 0.09 | 3.96 | 0.95 | 0.12 | ||||
| 5.28 | 0.96 | 3.85 | 0.96 | |||||||
| JM | 4.17 | 0.10 | 0.12 | 3.57 | 0.47 | 0.13 | ||||
| 0.80 | 0.96 | 1.28 | 0.97 | |||||||
| (p,n)=(200,100) | (p,n)=(1000,200) | (p,n)=(10000,400) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MAE | CR | power | MAE | CR | power | MAE | CR | power | ||
| method | Auto-regressive with | |||||||||
| MPI | 0.48 | 0.97 | 0.81 | 0.27 | 0.96 | 1.00 | 0.17 | 0.94 | 1.00 | |
| 0.45 | 0.98 | 0.24 | 0.98 | 0.15 | 0.97 | |||||
| RID | 0.40 | 0.96 | 0.91 | 0.26 | 0.95 | 1.00 | 0.17 | 0.94 | 1.00 | |
| 0.36 | 0.98 | 0.23 | 0.98 | 0.15 | 0.97 | |||||
| GBRD | 0.37 | 0.95 | 0.95 | 0.25 | 0.96 | 1.00 | ||||
| 0.30 | 0.98 | 0.21 | 0.98 | |||||||
| JM | 0.41 | 0.80 | 0.96 | 0.30 | 0.78 | 1.00 | ||||
| 0.21 | 0.97 | 0.15 | 0.98 | |||||||
| Auto-regressive with | ||||||||||
| MPI | 0.64 | 0.97 | 0.65 | 0.34 | 0.95 | 0.98 | 0.21 | 0.91 | 1.00 | |
| 0.61 | 0.98 | 0.29 | 0.97 | 0.17 | 0.97 | |||||
| RID | 0.49 | 0.95 | 0.83 | 0.31 | 0.95 | 0.99 | 0.21 | 0.91 | 1.00 | |
| 0.43 | 0.98 | 0.27 | 0.97 | 0.17 | 0.97 | |||||
| GBRD | 0.47 | 0.95 | 0.86 | 0.31 | 0.95 | 0.98 | ||||
| 0.39 | 0.98 | 0.28 | 0.97 | |||||||
| JM | 0.48 | 0.67 | 0.96 | 0.33 | 0.67 | 1.00 | ||||
| 0.20 | 0.97 | 0.15 | 0.97 | |||||||
| Auto-regressive with | ||||||||||
| MPI | 1.60 | 0.95 | 0.23 | 0.78 | 0.90 | 0.63 | 0.49 | 0.74 | 0.98 | |
| 1.60 | 0.96 | 0.65 | 0.96 | 0.27 | 0.96 | |||||
| RID | 0.99 | 0.89 | 0.49 | 0.68 | 0.83 | 0.79 | 0.48 | 0.72 | 0.99 | |
| 0.78 | 0.96 | 0.47 | 0.96 | 0.25 | 0.96 | |||||
| GBRD | 1.08 | 0.91 | 0.46 | 0.76 | 0.91 | 0.66 | ||||
| 0.89 | 0.97 | 0.64 | 0.97 | |||||||
| JM | 0.89 | 0.38 | 0.75 | 0.69 | 0.57 | 0.85 | ||||
| 0.23 | 0.97 | 0.27 | 0.96 | |||||||
| (p,n)=(200,100) | (p,n)=(1000,200) | (p,n)=(10000,400) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MAE | CR | power | MAE | CR | power | MAE | CR | power | ||
| method | Factor model with | |||||||||
| MPI | 2.14 | 0.97 | 0.40 | 1.25 | 0.95 | 0.79 | 0.88 | 0.93 | 0.95 | |
| 2.00 | 0.98 | 1.09 | 0.98 | 0.67 | 0.98 | |||||
| RID | 1.70 | 0.96 | 0.58 | 1.19 | 0.95 | 0.83 | 0.87 | 0.93 | 0.96 | |
| 1.44 | 0.98 | 0.99 | 0.98 | 0.66 | 0.98 | |||||
| GBRD | 1.70 | 0.91 | 0.63 | 1.29 | 0.87 | 0.84 | ||||
| 1.13 | 0.98 | 0.81 | 0.98 | |||||||
| JM | 2.47 | 0.34 | 0.68 | 2.09 | 0.30 | 0.85 | ||||
| 0.45 | 0.96 | 0.37 | 0.98 | |||||||
| Factor model with | ||||||||||
| MPI | 3.87 | 0.97 | 0.13 | 2.26 | 0.96 | 0.37 | 1.47 | 0.96 | 0.68 | |
| 3.76 | 0.98 | 2.10 | 0.97 | 1.37 | 0.97 | |||||
| RID | 2.57 | 0.94 | 0.27 | 1.98 | 0.94 | 0.44 | 1.44 | 0.95 | 0.70 | |
| 2.06 | 0.98 | 1.66 | 0.97 | 1.28 | 0.97 | |||||
| GBRD | 2.66 | 0.92 | 0.27 | 2.09 | 0.88 | 0.44 | ||||
| 1.99 | 0.98 | 1.45 | 0.97 | |||||||
| JM | 3.40 | 0.15 | 0.43 | 3.35 | 0.10 | 0.50 | ||||
| 0.36 | 0.96 | 0.33 | 0.98 | |||||||
| Factor model with | ||||||||||
| MPI | 5.08 | 0.98 | 0.06 | 2.97 | 0.97 | 0.18 | 1.99 | 0.96 | 0.43 | |
| 4.96 | 0.98 | 2.89 | 0.97 | 1.89 | 0.97 | |||||
| RID | 2.68 | 0.91 | 0.23 | 2.41 | 0.92 | 0.28 | 1.89 | 0.93 | 0.47 | |
| 1.85 | 0.98 | 1.84 | 0.97 | 1.59 | 0.97 | |||||
| GBRD | 3.00 | 0.95 | 0.16 | 2.52 | 0.91 | 0.26 | ||||
| 2.43 | 0.98 | 1.88 | 0.98 | |||||||
| JM | 3.13 | 0.22 | 0.49 | 3.61 | 0.07 | 0.44 | ||||
| 0.38 | 0.96 | 0.20 | 0.99 | |||||||
| (p,n)=(200,100) | (p,n)=(1000,200) | (p,n)=(10000,400) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MAE | CR | power | MAE | CR | power | MAE | CR | power | ||
| method | Group structure with | |||||||||
| MPI | 4.56 | 0.85 | 0.23 | 3.28 | 0.76 | 0.51 | 2.92 | 0.70 | 0.92 | |
| 2.85 | 0.98 | 1.50 | 0.98 | 0.98 | 0.97 | |||||
| RID | 3.78 | 0.81 | 0.27 | 3.22 | 0.75 | 0.54 | 2.91 | 0.70 | 0.92 | |
| 2.24 | 0.98 | 1.43 | 0.98 | 0.97 | 0.97 | |||||
| GBRD | 6.20 | 0.90 | 0.14 | 5.65 | 0.88 | 0.18 | ||||
| 1.86 | 0.98 | 1.34 | 0.98 | |||||||
| JM | 3.00 | 0.79 | 0.27 | 3.72 | 0.79 | 0.26 | ||||
| 1.44 | 0.98 | 1.07 | 0.98 | |||||||
| Group structure with | ||||||||||
| MPI | 4.73 | 0.90 | 0.18 | 3.14 | 0.73 | 0.52 | 2.80 | 0.62 | 0.92 | |
| 2.76 | 0.98 | 1.46 | 0.98 | 0.96 | 0.97 | |||||
| RID | 3.75 | 0.84 | 0.27 | 3.08 | 0.72 | 0.55 | 2.80 | 0.62 | 0.92 | |
| 2.18 | 0.98 | 1.40 | 0.98 | 0.95 | 0.97 | |||||
| GBRD | 5.11 | 0.94 | 0.12 | 4.16 | 0.90 | 0.19 | ||||
| 1.81 | 0.98 | 1.31 | 0.98 | |||||||
| JM | 2.75 | 0.77 | 0.33 | 3.15 | 0.79 | 0.33 | ||||
| 1.41 | 0.98 | 1.05 | 0.98 | |||||||
| Group structure with | ||||||||||
| MPI | 4.55 | 0.92 | 0.16 | 2.99 | 0.74 | 0.53 | 2.69 | 0.59 | 0.92 | |
| 2.68 | 0.98 | 1.43 | 0.98 | 0.94 | 0.97 | |||||
| RID | 3.63 | 0.86 | 0.26 | 2.95 | 0.73 | 0.57 | 2.69 | 0.59 | 0.93 | |
| 2.13 | 0.98 | 1.37 | 0.98 | 0.93 | 0.97 | |||||
| GBRD | 4.28 | 0.94 | 0.13 | 3.42 | 0.91 | 0.21 | ||||
| 1.77 | 0.98 | 1.28 | 0.98 | |||||||
| JM | 2.51 | 0.77 | 0.39 | 2.64 | 0.75 | 0.43 | ||||
| 1.37 | 0.98 | 1.03 | 0.98 | |||||||
| Extreme correlation | ||||||||||
| MPI | 3.05 | 0.85 | 0.19 | 2.66 | 0.36 | 0.41 | 2.62 | 0.33 | 0.53 | |
| 3.06 | 0.98 | 1.63 | 0.98 | 1.06 | 0.98 | |||||
| RID | 2.82 | 0.42 | 0.31 | 2.76 | 0.33 | 0.40 | 2.69 | 0.31 | 0.52 | |
| 2.09 | 0.97 | 1.46 | 0.98 | 1.04 | 0.98 | |||||
| GBRD | 4.03 | 0.69 | 0.16 | 3.75 | 0.75 | 0.19 | ||||
| 1.91 | 0.98 | 1.36 | 0.98 | |||||||
| JM | 3.62 | 0.17 | 0.34 | 3.48 | 0.29 | 0.37 | ||||
| 0.35 | 0.95 | 0.48 | 0.98 | |||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Gaussian Processes and Bayesian Inference · Control Systems and Identification
Scalable simultaneous inference in high-dimensional linear regression models
Tom Boot
University of Groningen Department of Economics, Econometrics and Finance, University of Groningen, Nettelbosje 2, 9747 AE Groningen, The Netherlands, e-mail: [email protected]
Didier Nibbering
Monash University Department of Econometrics & Business Statistics, Monash University, Clayton VIC 3800, Australia, e-mail: [email protected]
We would like to thank Paul Bekker, Patrick Groenen, Trevor Hastie, Christiaan Heij, Richard Paap, Andreas Pick, and participants of the workshop on statistical learning and econometrics at the Erasmus University Rotterdam for helpful comments.
Abstract
The computational complexity of simultaneous inference methods in high-dimensional linear regression models quickly increases with the number variables. This paper proposes a computationally efficient method based on the Moore-Penrose pseudoinverse. Under a symmetry assumption on the available regressors, the estimators are normally distributed and accompanied by a closed-form expression for the standard errors that is free of tuning parameters. We study the numerical performance in Monte Carlo experiments that mimic the size of modern applications for which existing methods are computationally infeasible. We find close to nominal coverage, even in settings where the imposed symmetry assumption does not hold. Regularization of the pseudoinverse via a ridge adjustment is shown to yield possible efficiency gains.
Keywords: high-dimensional regression, confidence intervals, Moore-Penrose pseudoinverse, ridge regression
1 Introduction
Modern data sets require inferential procedures that are valid in high-dimensional settings where the number of variables exceeds the sample size . Typical examples occur in genome wide association studies, such as Sesia et al., (2018) where and , or Rietveld et al., (2013) where and .
Debiasing, or desparsification, strategies allow simultaneous inference on full parameters vector in high-dimensional linear regression models.Debiasing relies on an approximate inverse of the singular high-dimensional covariance matrix of the regressors, and the use of the lasso to correct for the bias resulting from this approximation. The remaining bias, generally factorized in a part due to the approximate inverse and one due to the lasso estimation error, is then negligible compared to the variance of the estimator.
One downside of current debiasing strategies is that the computational burden increases rapidly with the number of included variables, as the approximate inverse is constructed by solving a series of optimization problems. Javanmard and Montanari, (2014) rely on direct numerical optimization for each variable to find an approximate inverse, while Van de Geer et al., (2014); Zhang and Zhang, (2014); Ning and Liu, (2017); Dezeure et al., (2017) solve lasso problems to estimate the inverse covariance matrix. Even with modern computing power, this puts a strain on computational resources when is large. As a result, existing simulation studies are typically limited to settings where . This drives a wedge between the scale demanded by modern applications and the scale at which the methods can be tested.
This paper proposes a scalable debiasing method that allows for simultaneous inference in high-dimensional linear regression models. The required approximate inverse is constructed by a scaled Moore-Penrose pseudoinverse, where the scaling ensures that the approximation bias becomes negligible. Replacing the sequence of optimization problems with a single pseudoinverse lowers the computational complexity with several orders of magnitude in the number of variables. Moreover, standard errors are available in closed form and free of tuning parameters.
The confidence intervals of the Moore-Penrose pseudoinverse estimator are shown to be valid under a symmetry assumption on the variables, which for example allows variables to be independent, equicorrelated, or to follow certain factor structures. We show that a ridge adjustment of the Moore-Penrose pseudoinverse estimator can lead to a power gain.
We carry out a simulation study where we consider dimensions and for which existing methods have substantial computational costs. We find that coverage is close to nominal for settings where the variables satisfy the symmetry assumption. The proposed ridge adjustment ensures that the method is competitive to existing alternatives in terms of power. We find undercoverage on the set of nonzero coefficients for some, but not all, of the settings that violate the symmetry assumption.
The proposed estimator is related to the corrected ridge estimator of Bühlmann, (2013), who finds that under a fixed design an additional bias correction is required that leads to conservative inference. Our estimator differs in the scaling factor that renders the approximation bias due to the diagonal elements of the approximate inverse exactly equal to zero. Under a random design and a slightly stronger sparsity assumption, we show that the additional bias correction is no longer required.
The outline of the paper is as follows. Section 2 introduces the estimation approach and the proposed estimators. The theoretical properties of the Moore-Penrose pseudoinverse and ridge regression are presented in Section 3. Section 4 illustrates these results through Monte Carlo simulations. Section 5 concludes.
**Notation ** We use the following notation throughout the paper: For any vector , the -norm is defined as for and denotes the number of nonzero elements of . The maximum norm is written as . For a matrix , the -norm is defined as and the maximum norm is written as . When is , we write and for the minimum and maximum eigenvalues of . The identity matrix is denoted by . The vector has its -th entry equal to 1 and zeros everywhere else. For the regressor matrix , we index the rows with the subscript and the columns with the subscript . If is a orthogonal matrix, we write . When two random variables and follow the same distribution, this is denoted as .
2 High-dimensional linear regression
Consider the high-dimensional linear model
[TABLE]
where is an response vector, an regressor matrix, a vector of unknown regressor coefficients, and an vector of errors which are and independent of .
We develop a computationally efficient debiased estimator for when , with a closed-form expression for the covariance matrix of the estimator. Standard errors and confidence intervals for are free of tuning parameters, closed-form, and scalable to a very large number of variables. Below we develop the estimator. Section 2.3 provides an algorithmic overview of the proposed inference method.
2.1 A debiased estimator
We consider estimators for of the form
[TABLE]
where is a matrix. The second term of (2) represents a bias which depends on the choice of . When and , ordinary least squares yields unbiased estimators by choosing . When , the matrix is singular, and any choice of will induce bias.
Given an initial estimator , we can reduce the bias in (2) by applying the correction
[TABLE]
For the initial estimator we use the lasso estimator of Tibshirani, (1996) defined by
[TABLE]
This leads to an interpretation of as a ‘desparsified’ version of the lasso estimator (Van de Geer et al.,, 2014), since we can write .
2.2 The approximate inverse
The goal of this paper is to introduce choices of that are computationally efficient and for which the bias of the corrected estimator is small. For the bias to be small, has to be close to the identity matrix . Hence, we refer to as an approximate inverse for .
We ensure that the diagonal terms of are identically equal to zero by introducing a diagonal matrix , with diagonal elements , and taking
[TABLE]
The effectiveness of as an approximate inverse is then determined by the magnitude of the off-diagonal elements of .
2.2.1 Moore-Penrose pseudoinverse
As a tuning parameter free choice for in (5) we consider the Moore-Penrose pseudoinverse (MPI) given by . That is,
[TABLE]
The diagonal elements of the diagonal scaling matrix equal
[TABLE]
This provides a closed-form expression for the approximate inverse. Since the bias term of the estimator is of lower order compared to the variance, the covariance of is available in closed form as well,
[TABLE]
2.2.2 Ridge regularization
We also consider the use of a ridge adjustment (RID), which in finite samples can lead to a power gain. Define
[TABLE]
where denotes the ridge penalty and the elements of the diagonal scaling matrix equal
[TABLE]
The ridge-adjustment is the natural regularization procedure as we can define the Moore-Penrose pseudoinverse as
[TABLE]
see e.g. Albert, (1972).
2.2.3 Alternative specifications
Zhang and Zhang, (2014) develop the estimator in (3) with , where is an approximate inverse to the empirical covariance matrix . This inverse is found by a series of lasso regressions. Van de Geer et al., (2014) generalize this method to deal with nonlinear models and prove semiparametric optimality. Javanmard and Montanari, (2014) use direct numerical optimization to obtain .
The ridge adjusted estimator, but with a different scaling factor, is studied by Bühlmann, (2013) for a fixed design and a less restrictive sparsity assumption then considered in Section 3. In this case an additional bias correction is necessary that results in conservative inference.
2.3 Inference
With a specification for the approximate inverse , confidence intervals can be constructed for the -th element of in (1) as
[TABLE]
where is the -th element of in (3), is the critical value for the standard normal distribution, and is the standard error of .
The covariance matrix of the estimator is equal to . The noise level is estimated with the scaled lasso estimator of Sun and Zhang, (2012), defined as
[TABLE]
with . This scaled lasso estimator is commonly used for noise estimation in high-dimensional regression settings.
Algorithm 1 outlines the proposed inference method. The algorithm requires a value for the ridge penalty. The numerical results in Section 4 indicate that the ridge adjustment only has an effect when and are of the same order of magnitude. In this case, following our theoretical results, setting improves the accuracy compared to the MPI estimator, while keeping coverage at the nominal level.
Algorithm 1 uses the lasso as the initial estimator and the scaled lasso for noise estimation. However, alternative initial estimators, such as proposed by Caner and Kock, (2014), can also be used as long as they satisfy a sufficiently tight accuracy bound on . Similarly, any consistent estimator for can be used for noise estimation. Instead of the scaled lasso, Reid et al., (2016) estimate the error variance from the lasso residuals with a degree of freedom correction. Yu and Bien, (2019) take the minimizing value of the lasso regression problem.
2.4 Computational complexity
Table 1 shows that the computational complexity of the proposed methods is several orders of magnitude in the number of variables lower compared to existing methods.
For the Moore-Penrose pseudoinverse, the leading order term in the computation complexity is the matrix multiplication , which costs . The same holds for the ridge regularized version.
The method of Van de Geer et al., (2014) solves a penalized regression problem with explanatory variables for every column of the empirical covariance matrix. A fast solver is provided by the lars algorithm of Efron et al., (2004), which has a complexity of for each column. Therefore, the complexity for obtaining the approximate inverse covariance matrix equals in total. Javanmard and Montanari, (2014) state that this is equivalent to the complexity of their method.
All methods use an initial estimator, which is provided by the lasso. With , this costs at a fixed value of the penalization parameter. For the initial estimator we apply -fold cross-validation, and , such that the computational complexity of all methods is at least . However, for none of the methods this is the leading order term.
3 Theoretical results
3.1 Assumptions
We make the following assumptions on the expansion rates of and , the regressor matrix and the errors .
Assumption 1**.**
The dimensions and are such that for some sufficiently large constant and the constant . The sparsity satisfies .
Assumption 2a**.**
Let be an matrix with independent subgaussian rows , where has independent subgaussian elements. The matrix satisfies , and . Assume .
Assumption 2b**.**
Let be an matrix with independent rows and identically subgaussian distributed columns. Let be an matrix with identically subgaussian distributed columns, and has finite eigenvalues bounded away from zero by a constant. Assume the factor structure , with fixed, the matrices , and independent, and as defined in 2a with the additional assumption that .
Assumption 3**.**
The rows of are spherically symmetric: The rows of satisfy , and in case of a factor structure the columns of also satisfy , for any fixed .
Assumption 4**.**
The errors and is independent of .
1 imposes a condition on the sample size relative to the number of variables. We operate under the assumption that , but we assume that is sufficiently larger than . The sparsity constraint restricts the number of non-zero coefficients in by . The assumption is standard for inference in high-dimensional regression models (Van de Geer et al.,, 2014; Javanmard and Montanari,, 2014), but slightly stronger than the required for lasso consistency Van de Geer, (2008). This discrepancy has been studied in more detail in Javanmard and Montanari, (2018). Since our results only depend on the norm of the lasso estimation error, the sparsity assumption can be relaxed to allow for approximate sparsity (Chernozhukov et al.,, 2015).
2a is taken from Theorem 2.4 in Javanmard and Montanari, (2014). This assumption is necessary to bound the bias of the initial estimator. It allows for a random with subgaussian independent rows, but the subgaussian assumption rules out a factor structure for , which is allowed by 2b.
3 is necessary to bound the bias of the approximate inverse. It is for example satisfied when the columns of have an equicorrelated covariance matrix, but also covers classes of factor models. Fan and Lv, (2008) use this assumption to derive accuracy bounds on the Moore-Penrose pseudoinverse in high-dimensional variable screening. These results depend on the behavior of the right singular vectors under spherical symmetry. In a slightly different context, Shah et al., (2020) show that under spherical symmetry, the right singular vectors can be used to estimate high-dimensional covariance matrices that contain a factor component. We find in Section 4 that our methods continue to work well in a range of settings in which the spherical symmetry assumption is violated, for example when the columns of have a covariance matrix with a Toeplitz structure.
Finally, 4 ensures that the estimated parameters are normally distributed. This assumption can be relaxed as shown by Van de Geer et al., (2014), in which case valid inference can be achieved for a fixed number of coefficients.
3.2 Theorems
The following theorem presents the main result of this paper.
Theorem 1**.**
Suppose 1 – 4 hold. Let , with as defined in (6) and as in (7). Then,
[TABLE]
where and .
The proof is given in Section B.1. Theorem 1 shows that the estimator in (3) has a lower order bias compared to the variance. Standard errors decrease at the usual rate. Theorem 1 allows for the construction of confidence intervals that are uniformly valid when is finite. Uniformity is guaranteed since the bound on the lasso estimator given in Lemma B.2 holds uniformly over all sets of size , see Van de Geer et al., (2014) for a discussion.
Since the resulting covariance matrix of the estimator is available in closed form, efficient multiple testing procedures as in Bühlmann, (2013) can be employed, together with joint tests on estimated coefficients.
Theorem 2 shows that for an appropriate choice of the regularization parameter, the ridge adjusted estimator can also be used for inference.
Theorem 2**.**
Suppose 1 – 4 hold. Let , with as in (9) and as in (10). Suppose . Then,
[TABLE]
where and .
The proof is given in Section B.2.
The reason one would opt for the regularized variants despite the additional tuning parameter is provided by the following theorem.
Theorem 3**.**
Suppose 1 – 4 hold. Denote the variance of the estimator under a diagonal scaling matrix by when is as in (6) and when is as in (9). For the choice of as in Theorem 2, we have
[TABLE]
The proof is given in Section B.3. The proof shows that the difference between variances of the RID and MPI estimators increases in . Since the proof of Theorem 2 in Section B.2 suggests that can attain a higher value when is close to , we expect a larger potential efficiency gain of ridge regularization in these settings.
Note that Theorem 3 requires the regularized estimator and the estimator based on the Moore-Penrose pseudoinverse to use the same diagonal scaling matrix. Using from (7) for the Moore-Penrose inverse and from (10) for the ridge regularized inverse, does not yield an ordering in terms of power. However, in all cases we have encountered, the inequality in Theorem 3 is satisfied when using the diagonal matrix specific to the estimator under consideration. This is also evident from the Monte Carlo results in Section 4.
4 Monte Carlo Experiments
This section examines the finite sample behaviour of the proposed estimators in a Monte Carlo experiment. The performance of the estimators is compared with Van de Geer et al., (2014) (GBRD) and Javanmard and Montanari, (2014) (JM), two existing methods for constructing confidence intervals in high-dimensional regression for all coefficients. All estimators use the lasso as initial estimator with a penalty term that minimizes the mean squared error under tenfold cross-validation. The series of lasso regressions in GBRD also use tenfold cross-validation, and we set the tuning parameter in JM, which is equal to the value used in their simulation studies.
The estimators are evaluated on mean absolute bias (MAE), coverage rate, power, and family-wise error rate (FWER). For each Monte Carlo replication, we calculate
[TABLE]
where MAE and coverage are similarly defined on the set of zero coefficients , and for . We report the average MAE, coverage and power over the Monte Carlo replications.
4.1 Monte Carlo design
The data generating process takes the form
[TABLE]
where is an regressor matrix, and a vector of coefficients. Table 2 shows the 14 simulation designs with different specifications for and , as used in Wang and Leng, (2015).
For each simulation design, 1000 data sets are generated with the number of predictors and sample size equal to , , and . The error variance is set to the value that satisfies . Note that due to the computational costs of GBRD and JAM, we only include MPI and RID in the Monte Carlo experiments with .
4.2 Results
The results of the Monte Carlo experiments confirm the theory from Section 3. Figure 1 shows the coverage rates of MPI and GBRD for non-zero () and zero () coefficients. The coverage of MPI on and is close to nominal for all independent, equicorrelated, and factor model designs. These simulation designs satisfy the spherical symmetry assumption on the regressor matrix. For the factor model, GBRD attains a slightly lower coverage on the nonzero coefficients, and this effect increases when moving from to .
Some, but not all, of the simulation designs that violate the symmetry assumption report undercoverage for both MPI and GBRD. Figure 1 shows undercoverage on for autoregressive regressors with correlation parameter , which increases with the dimensionality of the data. For both the group structure and the extreme dependence simulation designs, the undercoverage on is quite severe.
Table 3 reports the average MAE, coverage and power for experiments with equicorrelated predictors with , autoregressive correlation with , and a factor model with . The results for the remaining designs are presented in Appendix D. All results corroborate the undercoverage of JM on found by Dezeure et al., (2015).
Since the equicorrelated and factor design are spherically symmetric, the coverage rates of MPI and RID are robust to the choice of and . For the autoregressive model structure, we find a drop in coverage on when moving to .
For all simulation designs with and , the noise in the pseudoinverse translates in high MAE for MPI. This can be largely remedied by incorporating the ridge adjustment. In terms of inference, this has the benefit that the coverage rate is not distorted, while power increases. For the two higher-dimensional settings, the findings for MPI and RID are quite similar.
When the number of variables is of the same order as the number of observations, the discussion below Theorem 3 suggests that regularization may improve the efficiency of the confidence intervals. Figure 2 shows histograms of the FWER across experiments for MPI and RID under an equicorrelated design with . In the case with , the RID rejection rates are indeed more tightly concentrated around the nominal rate of 5% than the MPI rejection rates. For and , the number of variables is an order of magnitude larger than the number of observations and the benefit of regularization disappears.
Figure 3 shows the logarithm of the computation times of MPI, GBRD, and JM for each simulation design. MPI demands computation time that is negligible compared to the time needed by GBRD and JM with and . Due to the computational costs of GBRD and JM, they are not included in the experiments with . We find that MPI is still computationally feasible in this setting, making them scalable to ultra high-dimensional data.
5 Conclusion
This paper proposes computationally efficient methods for constructing confidence intervals in high-dimensional linear regression models. We employ a debiasing strategy where the Moore-Penrose pseudoinverse is used as an approximate inverse of the singular empirical covariance matrix of the regressors. This strategy is shown to be valid under a symmetry assumption on the regressors. The covariance matrix of the estimates is available in closed form and free of tuning parameters. Confidence intervals can then be constructed using standard procedures. A ridge-based regularization can yield power improvements in finite samples.
We consider large scale Monte Carlo experiments that show that the proposed estimators provide valid confidence intervals with correct coverage rates, even when the symmetry assumption on the regressors does not hold.
Appendix A Preliminary results
Throughout, and denote generic positive constants that can differ between lines. We abbreviate as .
A.1 Properties of spherical distributions
Decompose by a singular value decomposition as
[TABLE]
where , the matrix of singular values, and . Under 3, is invariant under right multiplication with an orthogonal matrix, and therefore is uniformly distributed on . When ,
[TABLE]
where is an matrix with the non-zero singular values on its diagonal, and is a matrix that satisfies . Since is uniformly distributed over , is uniformly distributed over the Stiefel manifold , defined as . See for example Chikuse, (1990).
We will use the following results.
Lemma A.1** (Lemma 5.3.2 of Vershynin, (2018)).**
Let be a fixed vector, and a matrix that is distributed uniformly over . Let , then with probability at least , we have that
[TABLE]
Lemma A.2** (Fan and Lv, (2008)).**
For any , there exists an such that with probability at most , we have that
[TABLE]
Proof: Lemma 5 of Fan and Lv, (2008) shows that for any , there exists an , such that
[TABLE]
where independent of . For random variables and and a constant , we have . Set , and . Using (A.5) and the standard tail bound bound , we have
[TABLE]
Lemma A.3**.**
Let . Under 2b there exists constants , such that with probability .
Proof: Notice that
[TABLE]
We first bound (I). By 2a, is a subgaussian vector with independent elements. Under this assumption, is subexponential with mean zero. From 2a, we also have that . Pick , such that . Then,
[TABLE]
Continuing with (II), we first condition on . From 2b, is a subgaussian random variable, and by independence between and , where and , is subexponential with mean zero. Then,
[TABLE]
Taking the expectation over , using Jensen’s inequality and the fact that ,
[TABLE]
Since with finite, we then also have that
[TABLE]
What remains to be shown is that there exists a such that
[TABLE]
For this it is sufficient to show that
[TABLE]
where has rows . Starting with the first, notice that
[TABLE]
Conditioning on and using that the elements of are independent subgaussian,
[TABLE]
Taking the expectation over , using Jensen’s inequality and the fact that , we then find that for fixed , there exists a constant such that
[TABLE]
By the same arguments as in Section 6.2.2 of Javanmard and Montanari, (2014), we also have that there exists a constant , such that
[TABLE]
This completes the proof.
Lemma A.4**.**
Suppose 1 and 2b hold. Denote by the smallest nonzero eigenvalue of . Then for some constant , with probability .
To determine the minimum nonzero eigenvalue, define such that and write
[TABLE]
The second to last line holds uniformly over with probability by (A.11) replacing by and noting in applying Jensen’s inequality that has subgaussian columns by 2b. This shows that . We will now show that the minimum nonzero eigenvalue of is bounded away from zero by a constant. This implies that for sufficiently larger than , Lemma A.4 holds.
Since , the minimum nonzero eigenvalue of is equal to
[TABLE]
with defined in 2a and the last line follows from 2a. Since is an matrix with independent subgaussian elements, by Vershynin, (2010) Theorem 5.39, with probability .
We now have shown that . From 1, we can assume that for some positive constant . Then with probability , .
Appendix B Proofs of the main theory
B.1 Proof of Theorem 1
Rewrite the estimator in (3) as
[TABLE]
The bias term of the estimator in (B.1) is bounded by the norm inequality,
[TABLE]
We then have that
[TABLE]
Lemma B.1 provides a probability bound for , and Lemma B.2 provides a probability bound for .
Lemma B.1**.**
Suppose 1 and 3 holds. Define as in (6) and as in (7). Then, there exists an such that
[TABLE]
Proof: Section C.1.
Lemma B.2**.**
Suppose 1, 4 and either 2a or 2b hold. Consider the lasso estimator (4) with for some sufficiently large constant ,
[TABLE]
Proof: Section C.2.
Combining 1, Lemma B.1, and Lemma B.2, it follows that the bias can be bounded by
[TABLE]
From 4 and (B.1) it follows that with . The following lemma shows that , which completes the proof of Theorem 1.
Lemma B.3**.**
Suppose 1 – 4 hold. Define and with as before. For ,
[TABLE]
Proof: Section C.3.
B.2 Proof of Theorem 2
We show that for a sufficiently small penalty parameter , the results under a Moore-Penrose inverse carry over to a ridge adjusted estimator. We adjust Lemma B.1 and Lemma B.3 to apply to the ridge adjusted estimator. Theorem 2 then follows.
Lemma B.4**.**
Suppose 1 and 3 holds. Define and as in (9) and (10). Choose . Then, there exists an such that
[TABLE]
Proof: Section C.4.
Lemma B.5**.**
Suppose 1 – 4 hold. Define and with as before. For ,
[TABLE]
Proof: Section C.5.
B.3 Proof of Theorem 3
The relative efficiency is determined by the difference
[TABLE]
with as in Theorem 1 and as in Theorem 2, but both estimators are assumed to either use as in (7) or as in (10). Define the matrix as a diagonal matrix with containing the eigenvalues of and by the Moore-Penrose inverse of . For the ridge regularized inverse we have
[TABLE]
where is a diagonal matrix with the diagonal elements satisfying .
For the Moore-Penrose pseudoinverse,
[TABLE]
where is a diagonal matrix with the diagonal elements .
Appendix C Proofs of the lemmas
C.1 Proof of Lemma B.1
As the diagonal elements of are identically zero, we need to establish a probability bound on the off-diagonal elements of . These elements are of the form . Fix . Then,
[TABLE]
where we use Lemma A.1 and Lemma A.2.
Taking the union bound over all , we get
[TABLE]
Now choose . Suppose . According to 1, . Then,
[TABLE]
C.2 Proof of Lemma B.2
For the bound in Lemma B.2 we first show that with high probability. Denoting by the index set of nonzero coefficients, this means that with high probability the lasso estimator satisfies by Lemma 6.3 of Bühlmann and Van De Geer, (2011).
We start by decomposing
[TABLE]
First consider (II). Denote by a standard normal random variable and . By 4, , and (II) is distributed as
[TABLE]
This term can be bounded in probability as
[TABLE]
where we first use Lemma A.3, and subsequently the tail bound for a standard normal random variable.
Choosing , the first term on the left hand-side is , while the second term is when is sufficiently larger than (1). We conclude that,
[TABLE]
For (I), notice that
[TABLE]
Define and note again that is distributed as . Conditional on , we can use the same arguments as before to show that
[TABLE]
Taking now the expectation of and using Jensen’s inequality, we obtain
[TABLE]
We conclude that for sufficiently large and , we have that
[TABLE]
From these results, we conclude that there exists an , such that with high probability
[TABLE]
Choosing then the lasso penalty parameter as for some sufficiently large constant , the results in Bühlmann and Van De Geer, (2011) show that with high probability .
The standard norm accuracy bounds for the lasso follow if the following restricted eigenvalue condition holds (Bickel et al.,, 2009).
Definition 1**.**
The restricted eigenvalue condition holds if
[TABLE]
for all for which and .
We start with the following lower bound.
[TABLE]
where has rows defined in 2a, , and denotes the th column of . (I) satisfies Definition 1 with probability exceeding under 2a by the proof in Section 6.2.1 of Javanmard and Montanari, (2014). For (II), since and have identically distributed columns,
[TABLE]
Taking the expectation over , using Jensen’s inequality and the fact that has subgaussian columns by 2b, we have that, uniformly over ,
[TABLE]
Since under 2b, is finite, we conclude that is with probability .
C.3 Proof of Lemma B.3
Normality follows from 4. What remains is to show that .
[TABLE]
By Cauchy-Schwarz,
[TABLE]
It follows that
[TABLE]
By Lemma A.3, with probability at least , we have . Taking a union bound over , we find that for sufficiently larger than , for some positive constant with probability .
An upper bound follows similarly. By Lemma A.4 we have that with probability at least ,
[TABLE]
Then, by Lemma A.1, with probability at least , . By 1 is sufficiently larger than , so applying a union bound over gives for some positive constant with probability . We conclude that is for .
C.4 Proof of Lemma B.4
Consider the -th element of ,
[TABLE]
By substituting , where , , , and we have
[TABLE]
where , , and , for .
Now,
[TABLE]
with the last line holding with probability by Lemma A.4. Choosing , we have that with probability ,
[TABLE]
Then, by Cauchy-Schwarz, with probability . In that case, with probability , and Lemma B.4 then follows from Lemma B.1.
C.5 Proof of Lemma B.5
Normality follows from 4. What remains is to show that .
[TABLE]
First the lower bound. By Cauchy-Schwarz
[TABLE]
We therefore have
[TABLE]
and the lower bound follows from Lemma A.3.
We now continue with an upper bound. Define as a diagonal matrix with . Note that by Lemma A.4, with probability . The numerator of (C.21) is then equal to
[TABLE]
with probability .
From this we have that with probability ,
[TABLE]
where we used that and
[TABLE]
with the last inequality following from 1. Under 3, we can now apply Lemma A.1 and find that with probability ,
[TABLE]
The proof now follows from a union bound.
Appendix D Results Monte Carlo experiments
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Albert, (1972) Albert, A. (1972). Regression and the Moore-Penrose pseudoinverse . Elsevier.
- 2Bickel et al., (2009) Bickel, P., Ritov, Y., and Tsybakov, A. (2009). Simultaneous analysis of lasso and dantzig selector. Annals of Statistics , 37:1705–1732.
- 3Bühlmann, (2013) Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli , 19(4):1212–1242.
- 4Bühlmann and Van De Geer, (2011) Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications . Springer Science & Business Media.
- 5Caner and Kock, (2014) Caner, M. and Kock, A. B. (2014). Asymptotically honest confidence regions for high dimensional parameters by the desparsified conservative lasso. ar Xiv preprint ar Xiv:1410.4208 .
- 6Chernozhukov et al., (2015) Chernozhukov, V., Hansen, C., and Spindler, M. (2015). Valid post-selection and post-regularization inference: An elementary, general approach. Annual Review of Economics , 7(1):649–688.
- 7Chikuse, (1990) Chikuse, Y. (1990). The matrix angular central Gaussian distribution. Journal of Multivariate Analysis , 33(2):265–274.
- 8Dezeure et al., (2015) Dezeure, R., Bühlmann, P., Meier, L., and Meinshausen, N. (2015). High-dimensional inference: Confidence intervals, p-values and r-software hdi. Statistical Science , pages 533–558.
