Estimation of variance components, heritability and the ridge penalty in high-dimensional generalized linear models
Jurre R. Veerman, Gwenael G.R. Leday, Mark A. van de Wiel

TL;DR
This paper compares estimators of variance components and heritability in high-dimensional generalized linear models, introducing efficient methods and demonstrating their superior accuracy over existing approaches.
Contribution
It introduces a novel maximum marginal likelihood estimator for ridge penalty in high-dimensional GLMs, extending previous methods and providing computational efficiency.
Findings
MML estimator outperforms CV in accuracy for Poisson and Binomial models.
Robustness of estimators against model departures like sparsity and non-Gaussian errors.
Software implementation enables reproducibility of results.
Abstract
For high-dimensional linear regression models, we review and compare several estimators of variances and of the random slopes and errors, respectively. These variances relate directly to ridge regression penalty and heritability index , often used in genetics. Direct and indirect estimators of these, either based on cross-validation (CV) or maximum marginal likelihood (MML), are also discussed. The comparisons include several cases of covariate matrix , with , such as multi-collinear covariates and data-derived ones. In addition, we study robustness against departures from the model such as sparse instead of dense effects and non-Gaussian errors. An example on weight gain data with genomic covariates confirms the good performance of MML compared to CV. Several extensions are presented. First, to the high-dimensional…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19| Linear | ||||||
|---|---|---|---|---|---|---|
| MML | 0.06 | 0.15 | 1.12 | 2.18 | 6.07 | 26.64 |
| Bayes | 0.04 | 0.31 | 4.38 | 1.10 | 7.78 | 93.25 |
| MoM | 0.01 | 0.08 | 1.03 | 0.17 | 2.32 | 23.70 |
| PCR | 0.05 | 0.39 | 5.36 | 1.39 | 10.31 | 116.80 |
| Basic | 0.05 | 0.46 | 6.56 | 1.44 | 12.40 | 145.18 |
| GCV | 0.20 | 0.46 | 4.56 | 12.26 | 26.41 | 111.38 |
| CV | 0.81 | 6.57 | 39.95 | 2.62 | 21.69 | 183.50 |
| HiLMM | 0.03 | 0.17 | 2.01 | 0.66 | 3.14 | 27.99 |
| Poisson | ||||||
| MML_mgcv | 0.32 | 0.33 | 0.31 | 26.21 | 40.19 | 48.17 |
| GCV_mgcv | 0.39 | 0.33 | 0.62 | 33.48 | 41.44 | 54.01 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Estimation of variance components, heritability and the ridge penalty in high-dimensional generalized linear models
111Supplementary Information available on …
Jurre R. Veerman
Gwenaël G.R. Leday
Mark A. van de Wiel
Dep. of Epidemiology & Biostatistics, Amsterdam Public Health research institute, Amsterdam University medical centers, Amsterdam, The Netherlands
Mathematical Institute, Leiden University, Leiden, The Netherlands
MRC Biostatistics Unit, Cambridge University, Cambridge, UK
Abstract
For high-dimensional linear regression models, we review and compare several estimators of variances and of the random slopes and errors, respectively. These variances relate directly to ridge regression penalty and heritability index , often used in genetics. Direct and indirect estimators of these, either based on cross-validation (CV) or maximum marginal likelihood (MML), are also discussed. The comparisons include several cases of covariate matrix , with , such as multi-collinear covariates and data-derived ones. In addition, we study robustness against departures from the model such as sparse instead of dense effects and non-Gaussian errors.
An example on weight gain data with genomic covariates confirms the good performance of MML compared to CV. Several extensions are presented. First, to the high-dimensional linear mixed effects model, with REML as an alternative to MML. Second, to the conjugate Bayesian setting, which proves to be a good alternative. Third, and most prominently, to generalized linear models for which we derive a computationally efficient MML estimator by re-writing the marginal likelihood as an -dimensional integral. For Poisson and Binomial ridge regression, we demonstrate the superior accuracy of the resulting MML estimator of as compared to CV. Software is provided to enable reproduction of all results presented here.
keywords:
Random effects, ridge regression, penalty parameter, heritability, genetics, empirical Bayes, marginal likelihood, cross-validation
††journal: Journal of LaTeX Templates
1 Introduction
Estimation of hyper-parameters is an essential part of fitting high-dimensional Gaussian random effect regression models, also known as ridge regression. These models are widely applied in genomics and genetics applications, where often the number of variables is much larger than the number of samples , i.e. .
We initially focus on the linear model. The goal is to estimate error variance and random effects variance or functions thereof, in particular the ridge penalty parameter, , or heritability index, . Here, the ridge penalty is used in classical ridge regression to shrink the regression coefficients to zero [1], whereas heritability measures the fraction of variation between individuals within a population that is due their genotypes [2]. The estimators of and can be used to estimate or , but also for statistical testing [3]. We review several estimators, based on maximum marginal likelihood (MML), moment equations, (generalized) cross-validation, dimension reduction, or degrees-of-freedom adjustment. Some of these estimators are classical, while others have recently been introduced.
We systematically review and compare the estimators in a broad variety of high-dimensional settings. For estimation of in low-dimensional settings, we refer to [4, 5, 6]. We address the effect of multi-collinearity and robustness against model misspecifications, such as sparsity and non-Gaussian errors. The comparisons are extended to the linear mixed effects model, with fixed effects added to the model and to Bayesian linear regression. The linear model part is concluded by a genomics data application to weight gain prediction after kidney transplantation.
The observed good performance of MML in the linear model setting was a stimulus to consider MML for high-dimensional generalized linear models (GLM). MML is more involved here than in the linear model, because of the non-conjugacy of the likelihood and prior. Therefore, approximations are required, such as Laplace ones. While these have been addressed by others [7, 8], we derive an estimator which is computationally efficient for settings. For Poisson and Binomial ridge regression, we demonstrate the superior accuracy of MML estimation of as compared to cross-validation.
Our software enables reproduction of all results. In addition, it allows comparisons for one’s own high-dimensional data matrix by simulating the response conditional on this matrix, as we do for two cancer genomics examples. Computational shortcuts and considerations are discussed throughout the paper, and detailed at the end, including computing times.
1.1 The Model
We initially focus on high-dimensional linear regression with random effects. Variables are denoted by and samples by . Then:
[TABLE]
Here, is the vector of responses, corresponds to the random effects and is a vector of Gaussian errors. Furthermore, is a fixed matrix: with .
1.2 Estimation Methods
We distinguish three categories of estimation methods:
Estimation of functions of , in particular [9], used in ridge regression to minimize , and heritability [10]. 2. 2.
Separate estimation of [11, 12], possibly followed by plug-in estimation of . 3. 3.
Joint estimation: estimate and jointly
Below, we discuss several methods for each of these categories. They have several matrices and matrix computations in common, which we therefore introduce first.
1.3 Notation and matrix computations
Throughout the paper, we will use the following notation:
[TABLE]
Many of the estimators below require calculations on potentially very large matrices. The following two well-known equalities can highly alleviate the computational burden.
First, , and hence also and , can be efficiently computed by using singular value decomposition (SVD). Decompose by SVD, and denote . Then,
[TABLE]
The latter requires inversion of an matrix only. Second, the following efficient trace computation for matrix products applies to
[TABLE]
2 Methods
2.1 Estimating functions of and
2.1.1 Estimating by -fold CV
A benchmark method that is used extensively to estimate is cross-validation. Here, we use -fold CV, as implemented in the popular R-package glmnet [13]. Let denote the set of samples left out for testing at the same fold as sample . Then, CV-based estimation of pertains to minimizing the cross-validated prediction error:
[TABLE]
where denotes the estimate of based on training samples and penalty . Note that for leave-one-out-cross-validation (-fold CV) the analytical solution of (5) is the PRESS statistic [14].
2.1.2 Estimating by Generalized Cross Validation
Generalized Cross Validation (GCV) is a rotation-invariant form of the PRESS statistic. It is more robust than the latter to (near-diagonal) hat matrices [9]. For the linear model, the criterion is [15]:
[TABLE]
where the trace of can be computed efficiently by (4). Then, .
2.1.3 Estimating heritability by HiLMM
Heritability is defined by . A recent method which estimates heritability directly using maximum likelihood is proposed in [10]. Analogously to equation (12), it is based on writing:
[TABLE]
where and . Now, apply an eigen-decomposition to : . Then, heritability is estimated by [10]:
[TABLE]
with and the th element of and , respectively. The authors provide rigorous consistency results for their estimator, as well as theoretical confidence bounds, also for mixed models and sparse settings.
2.2 Estimation of
The two methods below rely on an estimate , where is estimated by (G)CV. Then is estimated conditional on If desired, may then be estimated by
2.2.1 Basic estimate
A basic estimate of , and often used in practice, is given by [16]:
[TABLE]
which is the residual mean square error. Here, the residual effective degrees of freedom [16] equals , with as in (2). We also considered (9) with , as in [17], which rendered similar, slightly inferior results.
2.2.2 PCR-based estimate
The estimator for may also be based on Principal Component Regression (PCR). PCR is based on the eigen-decomposition . Denoting and , we have . Then, is reduced from columns to principal components, a crucial step [12]. Using the reduced model, is estimated by the residual mean square error [12]:
[TABLE]
2.3 Joint estimation of and
2.3.1 MML
An Empirical Bayes estimate of and is obtained by maximizing the marginal likelihood (MML), also referred to as model evidence in machine learning [18]. This corresponds to:
[TABLE]
Since , is simply derived from the convolution of Gaussian random variables, implying , and , so
[TABLE]
This is easily maximized over and . Note that after computing (12) requires operations on matrices only.
2.3.2 Method of Moments (MoM)
An alternative to MML is to match the empirical second moments of to their theoretical counterparts. From (12) we observe that the covariances depend on only. Hence, we obtain an estimator of by equating the sum of to that of the theoretical covariances, , with as in (12). Then, with , an estimator for is obtained by substituting and equating the sum of to the sum of theoretical variances, :
[TABLE]
These equations also hold for non-Gaussian error terms, which could be an advantage over MML. Moreover, no optimization over and is required, so MoM is computationally very attractive.
3 Comparisons
For the linear random effects model (ridge regression) we study the following settings:
and generated from model (1), independent
- 2.
or generated from non-Gaussian distributions, independent
- 3.
and from model (1), multicollinear
- 4.
and from model (1), data-based .
As is common for real data, the variables, i.e. the rows of , were always standardized for the -penalty to have the same effect on all variables. All the results are based on 100 simulated data sets. Cross-validation is applied on 10 folds. Results from -fold CV (leave-one-out) were generally fairly similar. We focus on the high-dimensional setting with , with excursions to larger data sets and dimensions of real data. In all visualizations below the red dotted lines indicate true values. Moreover, values larger than 20 times the true value were truncated and slightly jittered. Discussion of all results is postponed to Section 3.4.
3.1 Independent
In correspondence to model (1) we sample :
[TABLE]
Figures 1(c)(a) and (b) display the results for and for a large data setting (which both imply ).
3.2 Departures from a normal effect size distribution
We study the robustness of the methods against (sparse) non-Gaussian effect size distribution or error distribution. In sparse settings, many variables do not have an effect. To mimic this, we simulated the ’s from a mixture distribution with a ‘spike’ and a Gaussian ‘slab’:
[TABLE]
Here, we set which implies , as in the Gaussian setting. Moreover, we also considered:
[TABLE]
where again the parameters are chosen such that and Apart from all other quantities are simulated as in (LABEL:basicmodel). Results are displayed for in Figure 1(c)(c) for the Laplace (= lasso) effect size distribution and in Supplementary Figure 3 for the spike-and-slab and uniform effect size distribution.
Moreover, we considered heavy-tailed errors by sampling
[TABLE]
where the scalar is chosen such that , as in the Gaussian error setting. Apart from , all other quantities are simulated as in (LABEL:basicmodel). Results are displayed in Supplementary Figure 3(c).
3.3 Multicollinear
3.3.1 Simulated
Next, the design matrix is sampled using block-wise correlation. We replace the sampling of in simulation model (LABEL:basicmodel) by:
[TABLE]
where is a unit variance covariance matrix with blocks of size with correlations on the off-diagonal. Figure 2(c)(a) shows the results for .
3.3.2 Real data
Finally, we consider the estimation of and in a high- and medium-dimensional setting where are real data, with likely collinear columns. The first data set (TCGA KIRC) concerns gene expression data of genes for kidney tumors. The second data set (TCPA OV) holds expression data of proteins for ovarian tumor samples. Details on both data sets are supplied in the Supplementary Information. To generate response we use model (LABEL:basicmodel) with given by the data. Here, and is set such that Figures 2(c)(b) and (c) show the results.
3.4 Discussion of results
3.4.1 MML vs MoM, Basic and PCR
Figures 1(c) and 2(c) and Supplementary Figure 3 clearly show superior performance of MML compared to MoM: both the bias and variability are much smaller for MML. Generally, MML also outperforms the Basic and PCR estimators of . The PCR estimator approaches the performance of MML for the KIRC and TCPA data (Figures 2(c)(b) and 2(c)(c)), and the Basic estimator performs reasonably well for the latter () data set. For other settings, the Basic estimator performs equally inferior as MoM. The results highlight the importance of joint estimation of and in high-dimensional settings, because of their delicate interplay.
3.4.2 MML vs GCV and CV
For the estimation of MML seems slightly superior to GCV and CV. GCV shows more estimates that deviate towards too small values of (e.g. Figures 1(c)(b) and 2(c)(b), i.e. the large settings), whereas CV tends to render somewhat more skewed results, either to the right (Figures 1(c)(a) and 1(c)(c), 2(c)(a)), or to the left (Figure 2(c)(b)). For the spike-and-slab and uniform effects sizes and the errors the right-skewness of the CV-results is more pronounced (Supplementary Figure 3), indicating that minimization of the cross-validated prediction error (5) is more vulnerable to non-Gaussian than MML and GCV. Note that the Laplace setting (Figure 1(c)(c)) relates directly to the lasso prior with scale parameter [19]. The results indicate that MML with Gaussian prior could be useful to find the lasso penalty, or serve as a fast initial estimate by simply setting the lasso penalty , which follows from the variance of the lasso prior.
3.4.3 MML vs HiLMM
For the estimation of heritability Figures 1(c) and 2(c) and Supplementary Figure 3 show very comparable performance of MML and HiLMM. This similar performance is not surprising given that both methods are likelihood-based. Hence, while reparametrizing the likelihood (7) is certainly useful to study it as function of [10], the reparametrization seems not beneficial for the purpose of estimating . In addition, unlike HiLMM, MML also returns estimates of and . Finally, comparing Figures 1(c)(a) and 1(c)(b) we observe that both MML and HiLMM clearly benefit from the larger and .
4 Data example
We re-analyse the weight gain data, recently discussed in [17]. Details on the data are presented there, we provide a summary. The data consists of expression profiles of individuals with kidney transplants, where profiles consists of 28,869 genes as measured by Affymetrix Human Gene 1.0 ST arrays. The data is available in the EMBL-EBI ArrayExpress database (www.ebi.ac.uk/arrayexpress) under accession number E-GEOD-33070. It is known that kidney transplantation may lead to weight gain, and the study [20] investigates whether gene expression can be used to predict this. Such a prediction can be used to decide upon additional measures to prevent excessive weight gains. We reproduced the analysis by [17] as much as possible, including their prior selection of 1000 genes. Details on minor discrepancies, and an alternative analysis that accounts for the gene selection are discussed in the Supplementary Material. These did not affect the comparison qualitatively.
In [17], the authors illustrate their focused ridge (fridge) method and compare it with conventional ridge. In short, fridge estimates sample-specific ridge penalties, based on minimizing a per sample mean squared error (MSE) criterion on the level of the linear predictor . Since is not known, it is replaced by an initial ridge estimate, Their sample specific penalty then depends on , and also on both and . The authors use GCV (6) to obtain , and a slight variation of (9) to estimate . They show that fridge improves upon GCV-based ridge estimation. We wish to investigate whether i) MML estimation of also improves the performance of GCV-based ridge regression; and ii) whether MML estimation further boosts the performance of the fridge estimator. Here, predictive performance is measured by the mean squared prediction error (MSPE) using leave-one-out cross-validation (loocv).
The estimates of MML differ markedly from those of GCV: , while . Using instead of for the estimation of substantially reduced the mean squared prediction error: while , a relative decrease of 12.1%. Using , as in [17], fridge also reduced the MSPE, but to a lesser extent: a relative decrease of 3.5% with respect to Application of fridge using did not further decrease , nor did it increase it. Possibly, the already fairly small value of left little room for improvement. Figure 3 displays absolute prediction errors per sample and illustrates the improved prediction by ridge using (and to a lesser extent by fridge) with respect to ridge using .
5 Extensions
5.1 Extension 1: Mixed effects model
A natural extension of the high-dimensional random effects model (1) is the mixed effects model:
[TABLE]
where we assume that the design matrix for the fixed effects, , is of low-rank, so , as opposed to the random effects design matrix . Restricted maximum likelihood (REML) deals with the fixed effects by contrasting them out. For the error contrast vector with , the marginal likelihood for the variance components equals (see e.g. [21]):
[TABLE]
with In addition to maximizing (18) as a function of , we attempted solving the set of two estimation equations suggested by [22], but this rendered instable results inferior to maximizing (18) directly.
Alternatively, MML may be used, but it has to be adjusted to also estimate the fixed effects in the model. This implies replacing in Gaussian likelihood (11) by , and optimizing (11) with respect to parameters, where is the number of fixed parameters. The mixed model simulation setting is as follows:
[TABLE]
where (implying variance for generating random effects) and (implying variance for generating fixed effects). Note that we focused on a fairly sparse setting for the random effects and larger prior variance of fixed effects than of random effects, which enables a stronger impact of the small number of fixed effects. Figure 4 shows the results of REML, MML and CV (by glmnet, using penalty factor 0 for the fixed effects) for the estimation of and .
From Figure 4 we observe that REML indeed improves MML in terms of bias, however at the cost of increased variability. For the estimation of , CV is fairly competitive to REML and MML, although it renders markedly more over-penalization.
5.2 Extension 2: Bayesian linear regression
So far, we focused on classical methods. Bayesian methods may be a good alternative. We applied the standard Bayesian linear regression model, i.e. the conjugate model with i.i.d. priors , with fixed and endowed with a vague inverse-gamma prior (see Supplementary Material for details). For this model the maximum marginal likelihood estimator for is still analytical [23], and so is the posterior mode estimate of . Figure 5 shows the results in comparison to MML, i.e. maximization of (12), for the random effects case with multi-collinear , as in Section 3.3.1. Results for other settings were in essence very similar.
From the results we conclude that the conjugate Bayes estimates are very close to those of MML. This is in line with the fact that this conjugate model with prior variance is known to render posterior mean estimates of that equal the -penalized ridge regression estimates.
The conjugate Bayesian model is scale-invariant, because the prior contains the error variance . Recently, it was criticized for its non-robustness against misspecification of the fixed when estimating [24]. However, in practice one needs to estimate by either empirical Bayes (e.g. maximum marginal likelihood) or full Bayes. We repeated the simulation by [24] (see Supplementary Material). The results show that the estimates of are much better when estimating by empirical Bayes instead of fixing it, and in fact very competitive to alternatives proposed by [24].
5.3 Extension 3: Generalized linear models
5.3.1 Setting
Motivated by the good results for MML in the linear setting, we wish to extend MML estimation to the high-dimensional generalized linear model (GLM) setting, where the likelihood depends on the regression parameter only via the linear predictor, . Hence, likelihood is defined by a density (e.g. Poisson), where is mapped to by a link function (e.g. ). As before, we a priori assume i.i.d. , here equivalent to an penalty when estimating by penalized likelihood. In [7] an iterative algorithm to estimate is derived which alternates estimation of by maximization w.r.t. , requiring the computation of the trace of a Hessian of a matrix. Here, the estimation of itself is much slower than in the linear case, because it is not analytic and requires iterative weighted least squares approximation. Below we show how to substantially alleviate the computational burden in the setting by re-parameterizing the marginal likelihood implying computations in instead of .
5.3.2 Method
We have for the marginal likelihood:
[TABLE]
where denotes the normal density with mean and variance . Now a crucial observation is that for GLM:
[TABLE]
because the likelihood depends on only via the linear predictor . Here, is the implied -dimensional prior distribution of . This is a multivariate normal: . Therefore, we have:
[TABLE]
Hence, the -dimensional integral may be replaced by an -dimensional one, with obvious computational advantages when . Moreover, the use of (22) allows applying implemented Laplace approximations, which tend to be more accurate in lower dimensions. The Laplace approximation requires . We emphasize that this does generally not equal where : the maximum of the commonly used penalized (log)-likelihood. However, can be computed by noting that
[TABLE]
In other words, this is the penalized log-likelihood when regressing on the identity design matrix using an smoothing penalty matrix . The latter fits conveniently into the set-up of [8], as implemented in the R-package mgcv. This also facilitates MML estimation of by maximizing , with as in (23). If the columns of are standardized (common in high-dimensional studies), has rank instead of , implying that does not exist and should be replaced by a pseudo-inverse , such as the Moore-Penrose inverse.
In a full Bayesian linear model setting, dimension reduction is also discussed by [25], where is substituted by a -dimensional factor analytic representation, which requires an SVD of . In addition, there it is not used for hyper-parameter estimation by marginal likelihood, but instead for specifying (hierarchical) priors for the factors.
5.3.3 Results
R packages like glmnet [13] and penalized [26] estimate by cross-validation, and also mgcv allows, next to the MML estimation, (generalized) CV estimation [8]. Figures 6(a)and 6)(b) show the results for Poisson ridge regression, with , generated as in (LABEL:basicmodel), and generated as in (LABEL:basicmodel) and (16), which denote the independent and multi-collinear setting, respectively.
Figure 6 clearly shows the superior performance of MML based on (22) over CV. In particular, glmnet and penalized render strongly upward biased values. The mgcv GCV values are still inferior to MML based ones, but much better than the latter two, which may be due to the different regression estimators used (Laplace approximation versus iterative weighted least squares). We should stress that CV does not target for the estimation of as such, but merely for minimizing prediction error. Nevertheless, the difference is remarkably larger than in the corresponding linear case (see Figures 1(c) and 2(c)).
The Supplementary Material shows the results for Binomial ridge regression. While the differences in performance are less dramatic than for the Poisson setting, MML still renders much better estimates of than CV-based approaches.
6 Computational aspects and software
All methods and simulations presented here are implemented in a few wrapper R scripts: one for the linear random effects model (which includes the conjugate Bayes estimator), one for the linear mixed effects model, and one for Poisson and Binomial ridge regression. Parallel computations are supported. The scripts allow exact reproduction of the results in this manuscript as well as comparisons for other simulation or user-specific real data cases. In addition, a script is supplied to produce the box-plots as in this manuscript.
HiLMM, PCR and CV implementations are provided by the R-packages HiLMM, v1.1 [10], ridge, v1.8-16 [12] (code slightly adapted for computational efficiency) and glmnet, v2.0-16 [13]. The methods MML, REML, Bayes, MoM, Basic and GCV were implemented by us for the linear random and mixed effects models. For Poisson and Binomial ridge regression we applied mgcv, v1.8-16 [8] after our re-parametrization (22) to obtain MML and GCV results, while for CV glmnet and penalized, v0.9-50 [26] were applied. For all methods that required optimization the R routine optim was used, with default settings. CV was based on 10 folds.
Computing times of the various methods largely depend on and , much less so on the exact simulation setting. These are displayed for and in Table 1, based on computations with one CPU of an Intel® Xeon® CPU E5-2660 v3 @ 2.60GHz server. For Poisson ridge regression, we only report the computing times of MML and GCV, because, as reported in Figure 6, the performance of CV-based methods was very inferior.
From Table 1 we conclude that MML is also computationally very attractive. Its efficiency is explained by the fact that, unlike many of other methods, it does not require an SVD or other matrix decomposition of . Moreover, the only computation that involves dimension is the product .
7 Discussion
We compared several estimators in a large variety of high-dimensional settings. The results showed that plain maximum marginal likelihood works well in many settings. MML is generally superior to methods that aim to separately estimate (9, 10). Apparently, the estimates of and are so intrinsically linked in the high-dimensional setting that separate estimation is sub-optimal. The moment estimator (MoM) is generally not competitive to MML. It may, however, be useful in large systems with multiple hyper-parameters to estimate relative penalties, which are less sensitive to scaling issues than the global penalty parameter [27]. MoM may also be a useful initial estimator for more complex estimators that are based on optimization, such as MML.
Possibly somewhat surprising is the good performance of MML for estimating and , as these are functions of and . For the estimation of it is generally better than or competitive to (generalized) CV, an observation also made for the low-dimensional setting [8]. The inferior performance of the basic estimator of (9) implies that alternative estimators of that use as a plug-in are unlikely to perform well in high-dimensional settings. Such estimators, including the original one by Hoerl and Kennard [1], are compared by [4, 6], who show that some do perform well in the low-dimensional setting. For Poisson ridge regression, similar estimators of are available [5], but these rely on an initial maximum likelihood estimator of , and hence do not apply to the high-dimensional setting. For estimating it should be noticed that HiLMM [10] aims to compute a confidence interval for as well. For that purpose their direct estimator (8) is likely more useful than MML on the pair . We also used Esther [28], which precedes HiLMM by sure independence screening. It did not improve HiLMM in our (semi-)sparse settings, and requires manual steps. However, it likely improves HiLMM results in very sparse settings [28].
For mixed effect models with a small number of fixed effects, MML compares fairly well to REML, with a larger bias, but smaller variance. Probably the potential advantage of contrasting out the fixed effects is small when the number of random effects is large. REML may have a larger advantage in very sparse settings [29] or when the number of fixed effects is large with respect to . Estimates from the conjugate Bayes model are very similar to those by MML. We show that estimating along with highly improves the estimates presented by [24], where a fixed value of is used. In the case of many variance components or multiple similar regression equations, Bayesian extensions that shrink the estimates by a common prior are appealing, in particular in combination with efficient posterior approximations such as variational Bayes [30].
Our model (1) implies a dense setting, but we have demonstrated that the MML and REML estimators of and are fairly robust against moderate sparsity, which corroborates the results by [29]. Nevertheless, true sparse models may be preferable when variable selection is desired, which depends on accurate estimation of . On the other hand, post-hoc selection procedures can be rather competitive [31]. Moreover, the sparsity assumption is questionable for several applications. E.g. in genetics, it was suggested that many complex traits (such as height or cholesterol levels) are not even polygenic, but instead “omnigenic” [32].
The extension of MML to high-dimensional GLM settings (22) is promising given its computational efficiency and performance for Poisson and Binomial regression. A special case of the latter, logistic regression, requires further research, because the Laplace approximations of the marginal likelihood are less accurate here [8]. Extension to survival is a promising avenue, because Cox regression is directly linked to Poisson regression [33]. Alternatively, parametric survival models may be pursued. To what extent the estimates of hyper-parameters impact predictions depends on the sensitivity of the likelihood to these parameters. For the linear setting, a re-analysis of the weight-gain data showed that predictions based on improved those based on .
The MML estimator can be extended to estimation of multiple variance components or penalty parameters, which was addressed by iterative likelihood minorization [34] and by parameter-based moment estimation [27]. The latter extends to non-Gaussian response such as survival or binary. Further comparison of these methods with multi-parameter MML, both in terms of performance and computational efficiency, is left for future research. Finally, in particular in genetics applications, extensions of estimation of variance components by MML to non-independent individuals can be implemented by use of a well-structured between-individual covariance matrix [3].
Although our simulations cover a fairly broad spectrum of settings, many other variations could be of interest. We therefore supply fully annotated R scripts https://github.com/markvdwiel/Hyperpar that allow i) comparison of all algorithms discussed here, also for one’s ‘own’ real covariate set ; and ii) reproduction of all results presented here.
Acknowledgement
Gwenaël Leday was supported by the Medical Research Council, grant number MR/M004421. We thank Jiming Jiang and Can Yang for their correspondence, input and software for the MM algorithm. In addition, we thank Kristoffer Hellton for providing the fridge software and data. Finally, Iuliana Ciocǎnea-Teodorescu is acknowledged for preparing the TCGA KIRC data.
Supplementary Material
7.1 Contents
This Supplementary Material contains:
Details on the estimation of and with the conjugate Bayesian model
- 2.
Estimation results from this Bayesian model for a simulation by [24]
- 3.
Details on the TCGA KIRC and TCPA OV gene and protein expression data
- 4.
An alternative analysis of the Weight gain data
- 5.
Supplementary Figures:
- (a)
Results for estimation for Binomial ridge regression
- (b)
Robustness of estimates for non-Gaussian ’s and errors
7.2 Bayesian linear regression
7.2.1 Method
The conjugate Bayesian linear regression model is:
[TABLE]
where is the design matrix, is the vector of unknown regression parameters, is the vector of random errors and the prior shape and rate hyper-parameters and are fixed (say and ) so as to induce a non-informative prior. In model (24) the ridge regularization parameter corresponds to . In the following we provide the marginal posterior distribution of and , as well as a reformulation of the marginal likelihood [23] that allows important computational savings.
The likelihood function and prior densities are
[TABLE]
Therefore, the joint posterior is given by
The marginal posterior distribution of is recognized to be a Student distribution with degrees of freedom:
[TABLE]
Here , and . The marginal posterior distribution of is a Gamma distribution:
[TABLE]
where and The marginal likelihood of the model is:
[TABLE]
The marginal likelihood in (26) involves the determination of , which can be computationally demanding when the number of variables is large and when we wish to evaluate the marginal likelihood for various values of . To tackle this problem it is helpful to consider the singular value decomposition of , (where and are respectively and orthogonal matrices and is the diagonal matrix of singular values with ) and focus on the linear model , where and , instead of . Using simple algebra it can be shown that and , which suggest that the marginal likelihood (that is invariant to linear transformations) is more easily determined in the orthogonalized space. Indeed, we now have that
[TABLE]
Additionally, using the fact that
[TABLE]
where , we deduce that
[TABLE]
and
[TABLE]
Finally, the log-marginal likelihood reduces to
[TABLE]
with constant . Expression (31) makes the evaluation of very efficient for different values of . Furthermore, the function is -concave in , which facilitates the use of numerical methods over grid-search approaches. In practice, it might be good to employ a numerical algorithm on a sub-domain of obtained from a rough grid-search.
7.2.2 Results for simulation by [24]
The prior in the conjugate Bayesian model (24) is scale-invariant, because it contains the error variance . Recently, it was criticized in [24] for its failure to estimate well when is fixed and misspecified. In reality, is often estimated as well, which may improve results. To study this, we repeated the linear regression simulation by [24], which has the following specifications: with , and Moreover was generated from the independent standard Normal, and error variance
Figure 7 shows the results for three methods. Here, ML serves as a benchmark and refers the ordinary maximum likelihood estimator of , referred to by [24] as ‘Least Squares’. In addition, BayesEB and BayesFix estimate by i.e. the square-root of the posterior mean of (25), with estimated by empirical Bayes (EB; maximizing (31)) and with fixed to (as in [24]), respectively.
As in [24], the results show that indeed the conjugate Bayes model is not robust against wrongly fixing . However, it also shows that EB estimation of strongly improves the estimate. In fact, these estimates are very competitive to the ones for the scale-independent prior, , which was advocated by [24].
7.3 Details on real data used for simulations
7.3.1 Kidney tumor gene expression
The TCGA KIRC data is downloaded from the harmonised GDC database http://cancergenome.nih.gov/ using the R-package TCGAbiolinks [35]. It contains RNAseq profiles of kidney renal clear cell carcinomas. Only those genes were retained that had more than two counts per million in at least three samples, rendering genes. The data were normalised using the trimmed mean of M-values method of the R-package edgeR [36]. Following common practice, all gene expression values were standardized.
7.3.2 Ovarian tumor protein expression
We use protein expression data from the cancer proteome atlas (TCPA; [37]), which holds 408 ovarian serous cystadenocarcinoma profiles measuring 224 proteins by reverse-phase protein arrays. These are available from the TCPA portal: https://www.tcpaportal.org/tcpa/. Data were normalized by median centering per sample. All protein expression values were standardized.
7.4 Weight gain data example: alternative analysis
In the main document we re-analysed the weight gain data recently discussed in [17]. The authors kindly provided the R-code corresponding to the results in [17], which includes their focused ridge (fridge) methodology. Below we report a small discrepancy between our analysis and theirs, and also present results of an alternative analysis.
The authors of [17] opted to present results on (rather than ) samples, for technical reasons (personal communication). Results differed very little, so we chose to use all samples of the original study [20]. In addition, the R-code by [17] includes a prior selection of those 1,000 genes with the largest absolute marginal correlation correlation to the response, i.e. weight gain. Such a prior selection can indeed enhance performance of ridge-type predictors. The analysis by [17], however, did not include the gene selection as part of the outer leave-one-out cross-validation loop for assessing predictive performance. This may lead to over-optimism of the prediction errors. We therefore repeated the analysis and comparison as presented in the main document, but with the gene selection as part of the outer CV-loop. Indeed the error estimates increased substantially: (was 14.40), and Nevertheless, the conclusion remains that the MML and fridge MSPEs are lower that those of GCV, with relative decreases of 12.5% and 5.8%, respectively. Figure 8 displays absolute prediction errors per sample and illustrates the improved prediction by ridge using with respect to ridge using .
7.5 Supplementary Figures
7.5.1 Robustness: non-Gaussian ’s and errors
Here, we show additional results for the simulation settings with either non-Gaussian ’s (spike-and-slab or Uniform) or errors (), as presented in the main document. In all cases, parameters of the prior or the error () distribution were set such that and .
7.5.2 Binomial ridge regression
Note that unlike Poisson ridge regression Binomial ridge regression is not implemented in penalized. Hence, we compare mgcv-based results only with glmnet. Figure 10 displays the results for estimating , in case binomial . Results did not qualitatively differ for and . While MML is fairly good on target, the estimates from GCV are roughly 10 times too small, whereas the estimates from CV by glmnet are roughly 2-3 times too large, with outliers towards a 10-100 fold overestimation.
References
- [1]
A. E. Hoerl, R. W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67.
- [2]
P. Visscher, W. Hill, N. Wray, Heritability in the genomics era - concepts and misconceptions, Nature Reviews Genetics 9 (4) (2008) 255–266.
- [3]
H. M. Kang, N. A. Zaitlen, C. M. Wade, D. Kirby, A.and Heckerman, M. J. Daly, E. Eskin, Efficient control of population structure in model organism association mapping, Genetics 178 (3) (2008) 1709–1723.
- [4]
G. Muniz, B. M. G. Kibria, On some ridge regression estimators: An empirical comparisons, Comm Stat Simul Comp 38 (3) (2009) 621–630.
- [5]
K. Månsson, G. Shukur, A poisson ridge regression estimator, Economic Modelling 28 (4) (2011) 1475–1481.
- [6]
B. Kibria, S. Banik, Some ridge regression estimators and their performances, J Modern Appl Statist Meth 15 (1) (2016) 12.
- [7]
S. Heisterkamp, J. Van Houwelingen, A. Downs, Empirical bayesian estimators for a poisson process propagated in time, Biom. J. 41 (4) (1999) 385–400.
- [8]
S. N. Wood, Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models, J. Roy. Statist. Soc., B 73 (1) (2011) 3–36.
- [9]
G. H. Golub, M. Heath, G. Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979) 215–223.
- [10]
A. Bonnet, E. Gassiat, C. Lévy-Leduc, Heritability estimation in high dimensional sparse linear mixed models, Elec J Statist 9 (2) (2015) 2099–2129.
- [11]
E. Cule, P. Vineis, M. De Iorio, Significance testing in ridge regression for genetic data, BMC Bioinf. 12 (2011) 372.
- [12]
E. Cule, M. De Iorio, A semi-automatic method to guide the choice of ridge parameter in ridge regression, arXiv preprint: 1205.0686 (2012) 32.
- [13]
J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, J. Statist. Soft. 33 (1) (2010) 1.
- [14]
D. Allen, The relationship between variable selection and data agumentation and a method for prediction, Technometrics 16 (1) (1974) 125–127.
- [15]
T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning, 2nd ed., Springer, New York, 2008.
- [16]
T. Hastie, R. Tibshirani, Generalized Additive Models, CRC press, 1990.
- [17]
K. H. Hellton, N. L. Hjort, Fridge: Focused fine-tuning of ridge regression for personalized predictions, Stat Med 37 (8) (2018) 1290–1303.
- [18]
K. Murphy, Machine learning, A Probabilistic Perspective, The MIT Press, 2012.
- [19]
R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B (1996) 267–288.
- [20]
A. Cashion, A. Stanfill, F. Thomas, L. Xu, T. Sutter, J. Eason, R. Ensell, M.and Homayouni, Expression levels of obesity-related genes are associated with weight change in kidney transplant recipients, PloS one 8 (3) (2013) e59962.
- [21]
URL http://statdb1.uos.ac.kr/teaching/multi-grad/ReML.pdf
- [22]
J. Jiang, Linear and generalized linear mixed models and their applications, Springer Science & Business Media, 2007.
- [23]
G. Karabatsos, Marginal maximum likelihood estimation methods for the tuning parameters of ridge, power ridge, and generalized ridge regression, Comm. Statist.-Sim. Comp. (To Appear).
- [24]
G. E. Moran, V. Rockova, E. I. George, On variance estimation for Bayesian variable selection, arXiv preprint arXiv:1801.03019.
- [25]
J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, M. West, Bayesian factor regression models in the “large p, small n” paradigm, Bayesian statistics 7 (2003) 733–742.
- [26]
J. Goeman, L1 penalized estimation in the Cox proportional hazards model, Biom. J. 52 (2010) 70–84.
- [27]
M. A. Van de Wiel, T. G. Lien, W. Verlaat, W. N. van Wieringen, S. M. Wilting, Better prediction by use of co-data: adaptive group-regularized ridge regression, Statist. Med. 35 (3) (2016) 368–381.
- [28]
A. Bonnet, C. Lévy-Leduc, E. Gassiat, R. Toro, T. Bourgeron, Improving heritability estimation by a variable selection approach in sparse high dimensional linear mixed models, J Roy Stat Soc C - Appl Statist 67 (4) (2018) 813–839.
- [29]
J. Jiang, C. Li, P. Debashis, C. Yang, , H. Zhao, On high-dimensional misspecified mixed model analysis in genome-wide association study, Ann Statist 44 (5) (2016) 2127–2160.
- [30]
G. G. R. Leday, M. de Gunst, G. B. Kpogbezan, A. W. van der Vaart, W. N. van Wieringen, M. A. van de Wiel, Gene network reconstruction using global-local shrinkage priors, Ann Appl Statist 11 (2017) 41–68.
- [31]
H. Bondell, B. Reich, Consistent high-dimensional bayesian variable selection via penalized credible regions, J. Amer. Statist. Assoc. 107 (2012) 1610–1624.
- [32]
E. A. Boyle, Y. I. Li, J. K. Pritchard, An expanded view of complex traits: From polygenic to omnigenic, Cell 169 (7) (2017) 1177–1186.
- [33]
T. Cai, R. Betensky, Hazard regression for interval-censored data with penalized spline, Biometrics 59 (3) (2003) 570–579.
- [34]
H. Zhou, L. Hu, J. Zhou, K. Lange, MM algorithms for variance components models, arXiv preprint arXiv:1509.07426.
- [35]
T. Colaprico, A.and Silva, C. Olsen, L. Garofano, D. Cava, C.and Garolini, T. Sabedot, T. M. Malta, S. Pagnotta, I. Castiglioni, M. Ceccarelli, G. Bontempi, H. Noushmehr, TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data, Nucl acids res 44 (8) (2015) e71–e71.
- [36]
M. Robinson, D. McCarthy, G. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics 26 (2010) 139–140.
- [37]
J. Li, Y. Lu, R. Akbani, Z. Ju, P. L. Roebuck, W. Liu, J.-Y. Yang, B. M. Broom, R. Verhaak, D. Kane, et al., TCPA: a resource for cancer functional proteomics data, Nature methods 10 (11) (2013) 1046.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. E. Hoerl, R. W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67.
- 2[2] P. Visscher, W. Hill, N. Wray, Heritability in the genomics era - concepts and misconceptions, Nature Reviews Genetics 9 (4) (2008) 255–266. doi:10.1038/nrg 2322 . · doi ↗
- 3[3] H. M. Kang, N. A. Zaitlen, C. M. Wade, D. Kirby, A.and Heckerman, M. J. Daly, E. Eskin, Efficient control of population structure in model organism association mapping, Genetics 178 (3) (2008) 1709–1723.
- 4[4] G. Muniz, B. M. G. Kibria, On some ridge regression estimators: An empirical comparisons, Comm Stat Simul Comp 38 (3) (2009) 621–630.
- 5[5] K. Månsson, G. Shukur, A poisson ridge regression estimator, Economic Modelling 28 (4) (2011) 1475–1481.
- 6[6] B. Kibria, S. Banik, Some ridge regression estimators and their performances, J Modern Appl Statist Meth 15 (1) (2016) 12.
- 7[7] S. Heisterkamp, J. Van Houwelingen, A. Downs, Empirical bayesian estimators for a poisson process propagated in time, Biom. J. 41 (4) (1999) 385–400.
- 8[8] S. N. Wood, Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models, J. Roy. Statist. Soc., B 73 (1) (2011) 3–36.
