Bayesian $l_0$-regularized Least Squares
Nicholas G. Polson, Lei Sun

TL;DR
This paper introduces a scalable method for Bayesian variable selection using $l_0$ regularization, linking Bayesian regularization with proximal updates, and demonstrates its effectiveness through simulations and real data examples.
Contribution
It establishes a connection between Bayesian regularization and proximal updates, enabling efficient spike-and-slab variable selection with the SBR algorithm.
Findings
SBR is faster and more scalable than traditional posterior sampling.
The method achieves comparable statistical properties to spike-and-slab priors.
Simulation and real data results validate the approach's efficiency and accuracy.
Abstract
Bayesian -regularized least squares is a variable selection technique for high dimensional predictors. The challenge is optimizing a non-convex objective function via search over model space consisting of all possible predictor combinations. Spike-and-slab (a.k.a. Bernoulli-Gaussian) priors are the gold standard for Bayesian variable selection, with a caveat of computational speed and scalability. Single Best Replacement (SBR) provides a fast scalable alternative. We provide a link between Bayesian regularization and proximal updating, which provides an equivalence between finding a posterior mode and a posterior mean with a different regularization prior. This allows us to use SBR to find the spike-and-slab estimator. To illustrate our methodology, we provide simulation evidence and a real data example on the statistical properties and computational efficiency of SBR versus directâŠ
| Variable | Lasso | Elastic Net | Spike & Slab | SBR |
|---|---|---|---|---|
| sex | ||||
| bmi | ||||
| map | ||||
| hdl | ||||
| ltg | ||||
| glu | â | â | ||
| â | â | |||
| â | â | |||
| â | ||||
| â | â | |||
| â | â | |||
| â | â | |||
| â | â | â | ||
| â |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
Bayesian Regularized Least Squares
Nicholas G. Polson111Nicholas G. Polson is Professor of Econometrics and Statistics at the University of Chicago Booth School of Business. email: [email protected]
*Booth School of Business
University of Chicago*
ââ
Lei Sun222Lei Sun is PhD Candidate at the Department of Statistics, University of Chicago. email: [email protected]
*Department of Statistics
University of Chicago
(First Draft: April 30, 2017
This Draft: Aug 12, 2018)
Abstract
Bayesian -regularized least squares is a variable selection technique for high dimensional predictors. The challenge is optimizing a non-convex objective function via search over model space consisting of all possible predictor combinations. Spike-and-slab (a.k.a. Bernoulli-Gaussian) priors are the gold standard for Bayesian variable selection, with a caveat of computational speed and scalability. Single Best Replacement (SBR) provides a fast scalable alternative. We provide a link between Bayesian regularization and proximal updating, which provides an equivalence between finding a posterior mode and a posterior mean with a different regularization prior. This allows us to use SBR to find the spike-and-slab estimator. To illustrate our methodology, we provide simulation evidence and a real data example on the statistical properties and computational efficiency of SBR versus direct posterior sampling using spike-and-slab priors. Finally, we conclude with directions for future research.
Keywords: Spike-and-Slab Prior, Regularization, Proximal Updating, Single Best Replacement, Lasso, Variable Selection, Sparsity, Bayes, Model Choice
1 Introduction
Bayesian regularization is an attractive solution for high dimensional variable selection as it directly penalizes the number of predictors. The caveat is the need to search over all possible model combinations, as a full solution requires enumeration over all possible models which is NP-hard. The gold standard for Bayesian variable selection are spike-and-slab priors, or Bernoulli-Gaussian mixtures (Mitchell and Beauchamp, 1988; George and McCulloch, 1993; Scott and Varian, 2014). Whilst spike-and-slab priors provide full model uncertainty quantification, they can be hard to scale to very high dimensional problems, and can have poor sparsity properties (Amini et al., 2012). On the other hand, techniques like proximal algorithms (Polson et al., 2015; Polson and Sun, 2017) can solve non-convex optimization problems which are fast and scalable, but they generally donât provide a full assessment of model uncertainty (Jeffreys, 1961; Hans, 2009; Scott and Berger, 2010; Li and Lin, 2010; Marjanovic and Solo, 2013).
Our goal is to build on the single best replacement (SBR) algorithm of Soussen et al. (2011) as one approach to Bayesian regularization, in contrast with other methods based on proximal algorithms (Parikh and Boyd, 2014; Polson et al., 2015; Polson and Sun, 2017). Our approach also builds on other Bayesian regularization methods including, for example, the Bayesian bridge (Polson et al., 2014), horseshoe regularization (Carvalho et al., 2010; Bhadra et al., 2017a), SVMs (Polson and Scott, 2011), Bayesian lasso (Hans, 2009; Park and Casella, 2008; Carlin and Polson, 1991), Bayesian elastic net (Li and Lin, 2010; Hans, 2011), spike-and-slab lasso (Rockova and George, 2017), and global-local shrinkage priors (Bhadra et al., 2017a; Griffin and Brown, 2010).
To fix notation, statistical regularization requires the specification of a measure of fit, denoted by and a penalty function, denoted by , where is a global regularization parameter. From a Bayesian perspective, and correspond to the negative logarithms of the likelihood and prior distribution, respectively. Regularization leads to an optimization problem of the form
[TABLE]
Taking a probabilistic approach leads to a Bayesian hierarchical model
[TABLE]
The solution to the minimization problem estimated by regularization corresponds to the posterior mode, , where denotes the posterior distribution. For example, regression with a least squares log-likelihood subject to a penalty such as an -norm (ridge) Gaussian probability model or -norm (lasso) double exponential probability model.
The rest of the paper is outlined as follows. Section 2 defines the Bayesian regularization problem and explores the connections with spike-and-slab priors. Section 3 introduces a novel connection between regularization and Bayesian inference. Section 4 surveys recent developments on regularization and best subset selection problems. Section 5 describes the SBR algorithm for regularization. Section 6 provides a comparison between SBR and other variable selection methods and models including Lasso and elastic net (Tibshirani, 1994; Zou and Hastie, 2005), Bayesian bridge (Polson et al., 2014), and spike-and-slab (George and McCulloch, 1993). Finally, Section 7 concludes with directions for future research.
2 Bayesian regularization
Consider a standard Gaussian linear regression model, where is a design matrix, is a -dimensional coefficient vector, and is an -dimensional independent Gaussian noise. After centralizing and all columns of , we ignore the intercept term in the design matrix as well as , and we can write
[TABLE]
To specify a prior distribution , we impose a sparsity assumption on , where only a small portion of all âs are non-zero. In other words, , where , the cardinality of the support of , also known as the pseudo-norm of . A multivariate Gaussian prior ( norm) leads to poor sparsity properties in this situation (see, e.g., Polson and Scott, 2010).
Sparsity-inducing prior distributions for can be constructed to impose sparsity. The gold standard is a spike-and-slab priors (Jeffreys, 1961; Mitchell and Beauchamp, 1988; George and McCulloch, 1993). Under these assumptions, each exchangeably follows a mixture prior consisting of , a point mass at [math], and a Gaussian distribution centered at zero. Hence we write,
[TABLE]
Here controls the overall sparsity in and accommodates non-zero signals. This family is termed as the Bernoulli-Gaussian mixture model in the signal processing community.
A useful re-parameterization, the parameters is given by two independent random variable vectors and such that , with probabilistic structure
[TABLE]
Since and are independent, the joint prior density becomes
[TABLE]
The indicator can be viewed as a dummy variable to indicate whether is included in the model. (Soussen et al., 2011)
Let be the âactive setâ of , and be its cardinality. The joint prior on the vector then factorizes as
[TABLE]
Let be the set of âactive explanatory variablesâ and be their corresponding coefficients. We can write . The likelihood can be expressed in terms of , as
[TABLE]
Under this re-parameterization by , the posterior is given by
[TABLE]
Our goal then is to find the regularized maximum a posterior (MAP) estimator
[TABLE]
By construction, the will directly perform variable selection. Spike-and-slab priors, on the other hand, will sample the full posterior and calculate the posterior probability of variable inclusion.
Finding the MAP estimator is equivalent to minimizing over the regularized least squares objective function (Soussen et al., 2011).
[TABLE]
This objective possesses several interesting properties:
The first term is essentially the least squares loss function. 2. 2.
The second term looks like a ridge regression penalty and has connection with the signal-to-noise ratio (SNR) . Smaller SNR will be more likely to shrink the estimate of towards [math]. If , the prior uncertainty on the size of non-zero coefficients is much larger than the noise level, that is, the SNR is sufficiently large, this term can be ignored. This is a common assumption in spike-and-slab framework in that people usually want or to be âsufficiently largeâ in order to avoid imposing harsh shrinkage to non-zero signals. 3. 3.
If we further assume that , meaning that the coefficients are known to be sparse a priori, then , and the third term can be seen as an regularization.
Therefore, our Bayesian objective inference is connected to -regularized least squares, which we summarize in the following proposition.
Proposition 1**.**
(Spike-and-slab MAP & regularization)
For some , assuming , , the Bayesian MAP estimate defined by (2.4) is equivalent to the regularized least squares objective, for some ,
[TABLE]
Proof.
First, assuming that
[TABLE]
gives us an objective function of the form
[TABLE]
Equation (2.6) can be seen as a variable selection version of equation (2.5). The interesting fact is that (2.5) and (2.6) are equivalent. To show this, we need only to check that the optimal solution to (2.5) corresponds to a feasible solution to (2.6) and vice versa. This is explained as follows.
On the one hand, assuming is an optimal solution to (2.5), then we can correspondingly define , , such that is feasible to (2.6) and gives the same objective value as gives (2.5).
On the other hand, assuming is optimal to (2.6), implies that we must have all of the elements in should be non-zero, otherwise a new will give a lower objective value of (2.6). As a result, if we define , will be feasible to (2.5) and gives the same objective value as gives (2.6).
Combining both arguments shows that the two problems (2.5) and (2.6) are equivalent. Hence we can use results from non-convex optimization literature to find Bayes MAP estimators. â
3 Bayesian regularization and Proximal Updating
Section 2 provides a connection between spike-and-slab and regularization, in the sense that regularization can be viewed as the MAP estimator of Bayesian spike-and-slab regression. From a Bayesian perspective, the posterior mean is preferred to the MAP estimator due to its superior risk minimization properties under mean squared error. Starck et al. (2013) also discusses the inadequacy of interpreting sparse regularization as Bayesian MAP estimator. We now show that the posterior mean estimation is also connected to regularization and introduce the connection with the help of proximal operators and Tweedieâs formula (Efron, 2011).
Gribonval (2011) considers the relationship between the MAP estimator and the posterior mean. Polson and Scott (2016) considers the relationship between posterior modes and envelopes. These approaches try to uncover the implicit prior that maps a posterior mean to a mode. As the posterior mean is designed to minimize the mean squared error and Bayes risk, this is the appropriate calculation.
3.1 Posterior mean optimality
Consider a normal mean problem in Bayesian setting,
[TABLE]
Then the optimal estimator of with respect to the quadratic loss is the posterior mean. To calculate this posterior mean, , Efron (2011) introduced Tweedieâs formula which, for the normal mean problem, gives
[TABLE]
where the marginal density of is
[TABLE]
Here is the probability density function (pdf) of . The interpretation of the Bayesian correction term,
[TABLE]
is to âregularizeâ the unbiased maximum likelihood estimator , which provides the optimal bias-variance trade-off for prediction, see Pericchi and Smith (1992) for further discussion. When both and are unknown, Donoho and Reeves (2013) proposes a procedure to estimate each term, which achieves the optimal Bayes risk, resulting in the plug-in estimator
[TABLE]
Another useful result applies Steinâs risk function to Tweedieâs formula. Then we can derive this optimal Bayes risk (Robbins, 1956),
[TABLE]
where is the Fisher Information for . This risk is optimal given the use of the posterior mean estimator.
Tweedieâs formula can be generalized to Gaussian linear regression in the following theorem.
Theorem 3.1**.**
Suppose in the Gaussian linear regression model,
[TABLE]
Let denote the prior density of , and the marginal (prior predictive) density of . Then, assuming exists, the posterior mean of given is
[TABLE]
Proof.
Let denote the multivariate normal density of , then the posterior density of given ,
[TABLE]
Therefore, the posterior mean of the quantity ,
[TABLE]
Note that by the property of the multivariate normal density,
[TABLE]
and so (3.4) becomes
[TABLE]
Multiplying both sides by and assuming exists, the posterior mean of given becomes
â
Itâs easy to see that, similar to Tweedieâs formula for the normal means problem, the posterior mean in the Gaussian linear regression (3.3) consists of two parts. One is the usual weighted least squares solution, and the other is a Bayesian correction by the gradient of the prior predictive score, . Griffin and Brown (2010) gives the equivalent result in the form when the least squares estimator instead of is conditioned on. Masreliez (1975) discusses the posterior mean under Gaussian prior but non-Gaussian likelihood. (Pericchi and Smith, 1992)
3.2 Regularized linear regression
Proximal operators and Tweedieâs formula also provide a way to connect the Bayesian posterior mean and the regularized least squares. Specifically, we want to find a , such that the Bayesian posterior estimator with the prior is the same as the solution to the -regularized least squares. That is,
[TABLE]
We now use the theory of proximal mappings to re-write this estimator. First, here are some definitions. The Moreau envelope and proximal mapping of a convex function are defined as
[TABLE]
The Moreau envelope is a regularized version of and approximates from below, and has the same set of minimizing values as . The proximal mapping returns the value that solves the minimization problem defined by the Moreau envelope. It balances two goals: minimizing , and staying near .
Now, observe that if is the value that achieves the minimum,
[TABLE]
which leads to By construction of the envelope,
[TABLE]
This leads to the fundamental proximal relation Therefore, write
[TABLE]
Meanwhile, the definition of proximal mapping (3.6) and its property (3.7) give us
[TABLE]
Combining with Tweedieâs formula (3.2), gives
[TABLE]
Hence, if we want to match a regularized least squares with a posterior mean, we can âsolveâ for the penalty , given a marginal distribution , via the equation for the proximal mapping (3.8). If is non-differentiable at a point, we replace by .
Gribonval (2011) provides the following answer. Given any , find such that . Then the penalty
[TABLE]
with the constant to ensure that . To see why this construction makes sense, simply take derivatives with respect to on both sides, and get
[TABLE]
To summarize, the solution to the regularized least squares problem
[TABLE]
is the posterior mode with the prior . Our proximal operators and Tweedieâs formula discussion shows that the regularized least squares solution can also be viewed as the posterior mean under an implied prior , see Strawderman et al. (2013).
To illustrate our result, When is sparsity-inducing, such as the spike-and-slab, we can construct the associated penalty , which is typically non-convex for both Gaussian and Laplace cases.
3.3 Example: Spike-and-slab Gaussian & Laplace prior
For the normal mean problem (3.1), assuming is the aforementioned spike-and-slab (Bernoulli Gaussian) prior (2.2), the marginal distribution of is a mixture of two mean zero normals,
[TABLE]
The posterior mean is given by
[TABLE]
Thus, , we can find such that ; then the penalty associated with the Bayesian posterior mean with the Bernoulli-Gaussian prior can be obtained by (3.9). doesnât have an analytical form, but can be computed numerically.
Amini et al. (2012) argued that Bernoulli-Gaussian priors are usually not applicable to real-world signals, and proposed Bernoulli-Laplace priors which are infinitely divisible and more appropriate for sparse signal processing. The Bernoulli-Laplace priors are very similar to the Bernoulli-Gaussian ones, their only difference being that the âslabâ parts are replaced by Laplace distributions. Therefore, the prior
[TABLE]
Mitchell (1994) and Hans (2009) studied the marginal and posterior distribution with the Laplace prior. With their results, the marginal density of with the Bernoulli-Laplace prior (3.10) is given by
[TABLE]
where
[TABLE]
Here is the cumulative distribution function (cdf) of the standard normal. The posterior mean is then given by
[TABLE]
Similarly, we are also able to find the penalty associated with this Bernoulli-Laplace prior numerically by (3.9). Itâs worth noting that this prior is a special case of the spike-and-slab Lasso prior proposed by Rockova and George (2017). In that paper the authors use a mixture of two Laplace distributions, one of which is very close to as its variance goes to zero. Both priors are capable of striking a balance between hard-thresholding and soft-thresholding.
For comparison, and are plotted in Figure 1; and are plotted in Figure 2. Both priors shrink small observations towards zero. For large observations, when is small, Bernoulli-Gaussian, like ridge regression, unnecessarily penalizes large observations too much, whereas Bernoulli-Laplace is more like Lasso. As gets larger, both priors get closer to hard-thresholding, and their associated penalties closer to SCAD-like non-convex penalties (Fan and Li, 2001).
4 Computing the -regularized regression solution
We now turn to the problem of computation. -regularized least squares (2.5) is closely related to the best subset selection in linear regression as follows.
[TABLE]
The -regularized least squares (2.5) can be seen as (4.1)âs Lagrangian form. However, due to high non-convexity of the -norm, (2.5) and (4.1) are connected but not equivalent. In particular, for any given , there exists an integer , such that (2.5) and (4.1) have the same global minimizer . However, itâs not true the other way around. Itâs possible, even common, that for a given , we cannot find a , such that the solutions to (4.1) and (2.5) are the same.
Indeed, for , let be respective optimal solutions to (4.1) and respective optimal objective values, and so . If we want a solution to (2.5) has , we need to find a such that
[TABLE]
with the caveat that such needs not exist.
Both problems involve discrete optimization and have thus been seen as intractable for large-scale data sets. As a result, in the past, norm is usually replaced by its convex relaxation norm to facilitate computation. However, itâs widely known that the solutions of norm problems provide superior variable selection and prediction performance compared with their convex relaxation such as Lasso. Zhang et al. (2014) studies the statistical properties of the theoretical solution to (2.5), and points out that the solution to the -regularized least squares should be better than Lasso in terms of variable selection especially when we have a design matrix that has high collinearity among its columns.
Bertsimas et al. (2016) introduced a first-order algorithm to provide a stationary solution to a class of generalized -constrained optimization problem, with convex ,
[TABLE]
Let be the Lipschitz constant for such that , . Their âAlgorithm 1â is as follows.
Initialize such that . 2. 2.
For , obtain as
[TABLE]
until convergence to .
where the operator is to keep the largest elements of a vector as the same, whilst to set all else to zero. It can also be called the hard thresholding at the largest element. In the least squares setting when , and are easy to compute. Bertsimas et al. (2016) then uses the stationary solution obtained by the aforementioned algorithm (4.3) as a warm start for their mixed integer optimization (MIO) scheme to produce a âprovably optimal solutionâ to the best subset selection problem (4.1).
Itâs worth pointing out that the key iteration step (4.3) is connected to the proximal gradient descent (PGD) algorithm many have used to solve the -regularized least squares (2.5), as well as other non-convex regularization problems. PGD methods solve a general class of problems such as
[TABLE]
where is the same as in (4.2), and , usually non-convex, is a regularization term. In this framework, in order to obtain a stationary solution , the key iteration step is
[TABLE]
where can be seen as a gradient descent step for and is the proximal operator for . In -regularized least squares, , and its proximal operator is just the hard thresholding at . That is, is to keep the same all elements no less than , whilst to set all else to zero. As a result, the similarity between (4.3) and (4.5) are quite obvious.
In a recent work, Jewell and Witten (2017) proposes an exact algorithm to obtain the global minimum of -regularized optimization in a computational neuroscience context. Consider the optimization problem
[TABLE]
By exploiting the sequential time series nature of the problem, one can recast the problem as a changepoint detection problem and use the available results in that literature to design a dynamic programming algorithm. Instead of trying to simultaneously find all âs where , the algorithm aims to find them sequentially, from to at each step. In this sense, it can reach a global minimum within . The authors further speed up the algorithm by pruning the set of possible changepoints to at each step of the sequential search, and reduce the expected time cost to .
5 Single best replacement (SBR) algorithm
The single best replacement (SBR) algorithm, originally developed by Soussen et al. (2011), provides solution to the variable selection regularization (2.6). Since (2.6) and the -regularized least squares (2.5) are equivalent, SBR also provides a practical way to give a sufficiently good local optimal solution to the NP-hard regularization.
Take a look at the objective (2.6). For any given variable selection indicator , we have an active set , based on which the minimizer of (2.6) has a closed form. will set every coefficients outside to zero, and regress on , the variables inside . Therefore, the minimization of the objective function can be determined by or alone. Accordingly, the objective function (2.6) can be rewritten as follows.
[TABLE]
The SBR algorithm thus tries to minimize via choosing the optimal .
The algorithm works as follows. Suppose we start as an initial , usually the empty set. At each iteration, SBR aimes to find a âsingle change of â, that is, a single removal from or adding to of one element, such that this single change decreases the most. SBR stops when no such change is available, or in other words, any single change of or will only give the same or larger objective value. Therefore, intuitively SBR stops at a local optima of .
SBR is essentially a stepwise greedy variable selection algorithm. At each iteration, both adding and removal are allowed, so this algorithm is one example of the âforward-backwardâ stepwise procedures. Itâs provable that with this feature the algorithm âcan escape from some [undesirable] local minimizersâ of (Soussen et al., 2015). Therefore, SBR can solve the -regularized least squares in a sub-optimal way, providing a satisfactory balance between efficiency and accuracy.
We are now writing out the algorithm more formally. For any currently chosen active set , define a single replacement as adding or removing a single element ,
[TABLE]
Then we compare the objective value at current with all of its single replacements , and choose the best one. SBR proceeds as follows.
Step 0: Initialize . Usually, . Compute . Set .
Step : For every , compute . Obtain the single best replacement .
- 1.
If , stop. Report as the solution. 2. 2.
Otherwise, set , , and repeat step .
Soussen et al. (2011) shows that SBR always stops within finite steps. With the output , the locally optimal solution to the -regularized least squares is just the coefficients of regressed on and zero elsewhere.
In order to include both forward and backward steps in the variable selection process, the algorithm needs to compute for every at every step. Because it involves a one-column update of current design matrix , this computation can be made very efficient by using the Cholesky decomposition, without explicitly calculating linear regressions at each step (Soussen et al., 2011). An R package implementation of the algorithm is available upon request.
6 Applications
6.1 Statistical properties of SBR and regularization
The design matrix in this experimentation has rows and columns. In order to impose high collinearity in the columns of , we construct it in the following way.
Construct a matrix consisting of random samples and obtain . If , will have a low-rank structure. Here we use . 2. 2.
Sample each of the rows of from the multivariate normal distribution . 3. 3.
Centralize and normalize the columns of such that each column sums to zero and has unit norm.
A design matrix constructed this way has highly collinear columns. is a highly sparse coefficient vector with elements, of which are zero, and the rest randomly chosen to be . The noise vector is sampled from . In this setting, the signal-to-noise ratio was defined as
[TABLE]
and is determined such that . Finally, let be the observation.
Essentially all kinds of regularization methods, including ridge regression, Lasso, and regularization, share a common difficulty: to find a suitable regularization parameter . A thorough theoretical treatment on the optimal hasnât been established, and in practice cross validation is often used.
Meanwhile, one of the advantages of regularization is that, compared with its counterpart, the estimates of coefficients are relatively insensitive to when itâs large enough, as shown in Figure 3. Therefore, we donât have to worry too much about choosing a particularly optimal .
In a large scale numerical experiment, we conducted simulation trials with random generated as above. We compare SBR with the ordinary least squares (OLS), two sparsity regularization methods, Lasso and elastic net with the tuning parameter , and two Bayesian MCMC methods, Bayesian bridge (Polson et al., 2014) and Bayesian spike-and-slab shrinkage (Scott and Varian, 2014), each with their state-of-the-art implementations. The regularization parameter in SBR is determined by a 10-fold cross validation. For Lasso and elastic net, the regularization parameter is chosen by the built-in cross validation in glmnet (Friedman et al., 2010). For Bayesian bridge, we use the default prior in the package BayesBridge (Polson et al., 2012). For spike-and-slab, the prior inclusion probability is set to be , and other hyperparameters are the same as in the default setting in the package BoomSpikeSlab (Scott, 2016).
Figure 4 shows the accuracy in estimating for the six methods. For OLS, Lasso, elastic net, and SBR, are the minimizers of the corresponding optimization problems, and for Bayesian bridge and spike-and-slab, are the posterior means. SBR performs as well as the gold standard Spike-and-slab and better than all else, including the two widely used convex regularization methods.
In terms of variable selection, we compare SBR with the other sparsity-inducing methods, Lasso, elastic net, and Bayesian spike-and-slab. For spike-and-slab, an coefficient is selected when its posterior inclusion probability is greater than . Figure 5 shows that all methods are able to select all the true non-zero coefficients almost all the time. However, when compared in terms of preventing false selection, SBR and spike-and-slab are the best, whereas Lasso and elastic net tend to drastically over-select.
Recently, Hastie et al. (2017) compared Lasso with best subset selection (Bertsimas et al., 2016), forward stepwise selection, and found similar phenomenon. Namely, the performance of best subset selection and forward stepwise selection is overall similar, and both tend to outperform Lasso in the high signal-to-noise setting. Since SBR is a forward-backward stepwise selection algorithm, itâs no surprise then that SBR should give better results than Lasso in such setting.
6.2 Computational efficiency and scalability of SBR
In section 6.1 weâve shown that the regularization has superior statistical properties in terms of both minimizing the estimation risk and selecting correct variables, especially its statistical performance improvement on convex regularization methods such as Lasso and elastic net.
Full Bayesian spike-and-slab performs very well statistically. In order to achieve this good performance, spike-and-slab needs to do a complete MCMC sampling, and this task could take a significant amount of time, especially in high-dimensional settings, whereas regularization methods are usually able to handle large-scale computation efficiently.
In order to compare the computational efficiency of different methods, two sets of experiments, one with , the other , are run, and the results are plotted in Figure 6. SBR, as well as Lasso and elastic net, is almost as efficient as OLS, and only changes proportionally when the size of the problem increases. On the contrary, the two full Bayesian methods, Bayesian bridge and especially spike-and-slab, are costly and scale badly with the problem size. Actually when , it could take as much as minutes to run even one spike-and-slab MCMC, whereas SBR finishes all simulation trials under seconds. When and are in thousands, spike-and-slab is computationally intractable.
6.3 Diabetes data
Now we examine the performance of SBR on the classic diabetes data, available in the R package lars (Efron et al., 2004). The design matrix has columns, including all biochemical attributes and certain interactions. Each column of has been normalized to have zero mean and unit norm. The response is centralized to have zero mean. We compare SBR with sparsity-inducing methods including Lasso, elastic net, and spike-and-slab priors, with the same settings as in Section 6.1, and determined by cross validation. The results shown in Table 1 indicate that SBRâs performance on variable selection is in line with popular sparse linear regression alternatives.
7 Discussion
Bayesian regularization can be solved using a fast and scalable single best replacement (SBR) algorithm. In variable selection, this estimator possesses much of the statistical properties of spike-and-slab priors. We provide theoretical links between the spike-and-slab MAP estimator and regularization.
We also explore the connection between regularized MAP estimators and posterior means (Strawderman et al., 2013). TweedieâMasreliez construction of the posterior mean is re-interpreted as a proximal update rule. This proximal update identity shows how the sparse posterior mode can be viewed as a posterior mean under a suitably defined prior. Bernoulli-Gaussian (BG) and Bernoulli-Laplace (BL) priors are used for illustration. Our approach demonstrates how regularized estimators can have good out-of-sample mean squared error.
In simulated and real data applications, SBR performs favorably compared with popular convex regularization methods such as Lasso and elastic net, as well as full Bayesian sampling methods including Bayesian bridge and spike-and-slab priors.
Recently non-convex feature selection methods for sparse signals estimation have gained increasing attention from the statistical learning community, including the classic SCAD penalty (Fan and Li, 2001), penalty (Marjanovic and Solo, 2013), and horseshoe regularization (Bhadra et al., 2017b). There are a number of future directions for research, such as regularized logistic regression (Gramacy and Polson, 2012) and structural sparsity learning (Polson and Sun, 2017). A comprehensive theoretical treatment and empirical comparison on different non-convex regularizations on the trade-off between statistical accuracy and computational efficiency remains open.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Amini et al. (2012) Amini, A., Kamilov, U. S., and Unser, M. (2012). âThe analog formulation of sparsity implies infinite divisibility and rules out Bernoulli-Gaussian priors.â In 2012 IEEE Information Theory Workshop , 682â686.
- 2Bertsimas et al. (2016) Bertsimas, D., King, A., and Mazumder, R. (2016). âBest subset selection via a modern optimization lens.â Ann. Statist. , 44(2): 813â852.
- 3Bhadra et al. (2017 a) Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2017 a). âThe horseshoe+ estimator of ultra-sparse signals.â Bayesian Analysis , to appear.
- 4Bhadra et al. (2017 b) â (2017 b). âHorseshoe Regularization for Feature Subset Selection.â ar Xiv preprint ar Xiv:1702.07400 .
- 5Carlin and Polson (1991) Carlin, B. P. and Polson, N. G. (1991). âInference for nonconjugate Bayesian models using the Gibbs sampler.â Canadian Journal of Statistics , 19(4): 399â405.
- 6Carvalho et al. (2010) Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). âThe horseshoe estimator for sparse signals.â Biometrika , 97(2): 465â480.
- 7Donoho and Reeves (2013) Donoho, D. and Reeves, G. (2013). âAchieving Bayes MMSE performance in the sparse signal + Gaussian white noise model when the noise level is unknown.â In 2013 IEEE International Symposium on Information Theory , 101â105.
- 8Efron (2011) Efron, B. (2011). âTweedieâs Formula and Selection Bias.â Journal of the American Statistical Association , 106(496): 1602â1614.
