Reducing Sampling Ratios Improves Bagging in Sparse Regression
Luoluo Liu, Sang Peter Chin, Trac D. Tran

TL;DR
This paper demonstrates that reducing the bootstrap sampling ratio in Bagging enhances sparse regression performance, especially with fewer measurements, outperforming traditional L1 minimization and Bolasso methods.
Contribution
It introduces a generalized Bagging framework with variable bootstrap ratios for sparse regression and provides theoretical analysis of performance limits.
Findings
Lower bootstrap ratio (60%-90%) improves recovery performance.
Reduced sampling rate increases SNR by up to 24%.
A small number of estimates (K=30) suffices for good results.
Abstract
Bagging, a powerful ensemble method from machine learning, improves the performance of unstable predictors. Although the power of Bagging has been shown mostly in classification problems, we demonstrate the success of employing Bagging in sparse regression over the baseline method (L1 minimization). The framework employs the generalized version of the original Bagging with various bootstrap ratios. The performance limits associated with different choices of bootstrap sampling ratio L/m and number of estimates K is analyzed theoretically. Simulation shows that the proposed method yields state-of-the-art recovery performance, outperforming L1 minimization and Bolasso in the challenging case of low levels of measurements. A lower L/m ratio (60% - 90%) leads to better performance, especially with a small number of measurements. With the reduced sampling rate, SNR improves over the original…
| Small | Moderate | Large | Very large | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| The number of measurements | 50 | 75 | 100 | 125 | 150 | 175 | 200 | 500 | 1000 | 2000 |
| min. | 0.12 | 0.57 | 1.00 | 1.70 | 2.19 | 2.61 | 2.97 | 6.53 | 9.46 | 12.55 |
| Conventional Bagging (L/m=1) | 0.45 | 0.94 | 1.29 | 1.86 | 2.29 | 2.70 | 3.01 | 6.22 | 9.06 | 12.10 |
| Bagging | 0.56 | 0.95 | 1.32 | 1.86 | 2.29 | 2.70 | 3.01 | 6.22 | 9.06 | 12.10 |
| Bolasso | 0.02 | 0.09 | 0.08 | 0.28 | 0.57 | 0.98 | 1.23 | 5.21 | 8.94 | 12.73 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning
Reducing Sampling Ratios and Increasing Number of Estimates Improve Bagging in Sparse Regression
Luoluo LiuJ, Sang (Peter) ChinB,J, and Trac D. TranJ
[email protected] [email protected] [email protected]
J Department of Electrical Engineering, Johns Hopkins University, Baltimore, MD, 21210
B Department of Computer Science Hariri Institute of Computing, Boston University, Boston, MA, 02215
Abstract
Bagging, a powerful ensemble method from machine learning, has shown the ability to improve the performance of unstable predictors in difficult practical settings. Although Bagging is most well-known for its application in classification problems, here we demonstrate that employing Bagging in sparse regression improves performance compared to the baseline method ( minimization). Although the original Bagging method uses a bootstrap sampling ratio of , such that the sizes of the bootstrap samples are the same as the total number of data points , we generalize the bootstrap sampling ratio to explore the optimal sampling ratios for various cases.
The performance limits associated with different choices of bootstrap sampling ratio and number of estimates are analyzed theoretically. Simulation results show that a lower ratio () leads to better performance than the conventional choice (), especially in challenging cases with low levels of measurements. With the reduced sampling rate, SNR improves over the original Bagging method by up to and over the base algorithm minimization by up to . With a properly chosen sampling ratio, a reasonably small number of estimates () gives a satisfying result, although increasing is discovered to always improve or at least maintain performance.
Index Terms:
Bootstrap, Bagging, Sparse Regression, Sparse Recovery, minimization, LASSO
I Introduction
Compressed Sensing (CS) and Sparse Regression studies solving the linear inverse problem in the form of least squares with an additional sparsity-promoting penalty term. Formally speaking, the measurements vector is generated by , where is the sensing matrix, is a vector of sparse coefficients with very few non-zero entries, and is a noise vector with bounded energy. The problem of interest is finding the sparse vector given as well as . Among various choices of sparse regularizers, the norm is the most commonly used. The noiseless case is referred to as Basis Pursuit (BP) whereas the noisy version is known as basis pursuit denoising [1], or least absolute shrinkage and selection operator (Lasso) [2]:
[TABLE]
The performance of minimization in recovering the true sparse solution has been thoroughly investigated in the CS literature [3, 4, 5, 6]. CS theory reveals that if the sensing matrix has good properties, then BP recovers the ground truth and the LASSO solution is close enough to the true solution with high probability [3].
Classical sparse regression recovery based on minimization solves the problem with all available measurements. In practice, it is often the case that not all measurements are available or required for recovery. Some measurements might be severely corrupted/missing or adversarial samples that break down the algorithm. These issues could lead to the failure of the sparse regression algorithm.
The Bagging procedure [7] proposed by Breiman is an efficient parallel ensemble method that improves the performance of unstable predictors. In Bagging, we first generate a bootstrap sample by randomly drawing samples uniformly with replacement from all data points. We repeat the process times and generate bootstrap samples. Then one bootstrapped estimator is computed for each bootstrap sample, and the final Bagged estimator is the average of all bootstrapped estimators.
Applying Bagging to find a sparse vector with a specific symmetric pattern was shown empirically to reduce estimation error when the sparsity level is high [7] in a forward subset selection problem. This experiment shows the possibility of using Bagging to improve other sparse regression methods on general sparse signals. Although the well-known conventional Bagging method uses the bootstrap ratio , some follow-up works have shown empirically that lower ratios improve Bagging in some classic classifiers: Nearest Neighbour Classifier [8], CART Trees [9], Linear SVM, LDA, and Logistic Linear Classifier [10]. Based on this success, we hypothesize that reducing the bootstrap ratio will also improve performance of Bagging in sparse regression. Therefore, we set up the framework with a generic bootstrap ratio and study its behavior with various bootstrap ratios.
In this paper, we use the notation as the sizes of bootstrap samples, as the number of all measurements, and as the number of estimates. (i) We demonstrate the generalized Bagging framework with bootstrap ratio and number of estimates as parameters. (ii) We explore the theoretical properties associated with finite and . (iii) We present simulation results with various parameters and and compare the performances of minimization, conventional Bagging, and Bolasso [11], another modern technique that incorporates Bagging into sparse recovery. An important discovery is that in challenging cases with small , Bagging with a ratio that is smaller than the conventional ratio can lead to better performance.
II Proposed Method: Bagging in Sparse Regression
Our proposed method is sparse recovery using a generalized Bagging procedure. It is accomplished in three steps. First, **we generate bootstrap samples, each of size , randomly sampled uniformly and independently with replacement from the original data points. ** This results in measurements and sensing matrices pairs: \{{{\boldsymbol{y}}\text{\scriptsize[{\mathcal{I}}{1}]}},{{\boldsymbol{A}}{[{\mathcal{I}}_{1}]}}\},\{{{\boldsymbol{y}}\text{\scriptsize[{\mathcal{I}}{2}]}},{{\boldsymbol{A}}{[{\mathcal{I}}_{2}]}}\}....,\{{{\boldsymbol{y}}\text{\scriptsize[{\mathcal{I}}_{K}]}},{{\boldsymbol{A}}{[{\mathcal{I}}_{K}]}}\}. We use the notation on matrices or vectors to denote retaining only the rows supported on and throwing away all other rows in the complement . Second, we solve the sparse recovery problem independently on each of those pairs; mathematically, for all , we find
[TABLE]
where the parameter is the balancing parameter of the least squares fit and the sparsity penalty for as the parameter choice for Bagging. The proposed approach (2) is a Lasso problem, and numerous optimization methods can be used to solve it, such as [12, 13, 14, 15].
Finally, the Bagging solution is obtained by averaging all estimators from solving (2):
[TABLE]
Compared to the minimization solution obtained from the usage of all the measurements, the bagged solution is obtained by resampling without increasing the number of original measurements. We will show that in some cases, the bagged solution outperforms the base minimization solution.
III Preliminaries
We summarize the theoretical results of CS theory which we need to analyze our algorithm mathematically. We introduce the Null Space Property (NSP), as well as the Restricted Isometry Property (RIP). We also provide the tail bound of the sum of i.i.d. bounded random variables, which is needed to prove our theorems.
III-A Null Space Property (NSP)
The NSP [16] for standard sparse recovery characterizes the necessary and sufficient conditions for successful sparse recovery using minimization.
Theorem 1** (NSP).**
Every sparse signal is a unique solution to if and only if satisfies NSP of order . Namely, if for all , such that for any set of cardinality less than or equals to the sparsity level , the following is satisfied:
[TABLE]
where {\boldsymbol{v}}\text{\footnotesize{[\mathcal{S}] }} only has the vector values on an index set and zero elsewhere.
III-B Restricted Isometry Property (RIP)
Although NSP directly characterizes the ability of success for sparse recovery, checking the NSP condition is computationally intractable. It is also not suitable to use NSP for quantifying performance in noisy conditions since it is a binary (True or False) metric instead of a continuous range. The Restricted isometry property (RIP) [3] is introduced to overcome these difficulties.
Definition 2** (RIP).**
A matrix with -normalized columns satisfies RIP of order if there exists a constant such that for every sparse , the following is satisfied:
[TABLE]
III-C Noisy Recovery bounds based on RIP constants
It is known that satisfying the RIP conditions implies that the NSP conditions are also satisfied for sparse recovery [3]. More specifically, if the RIP constant of order is strictly less than , then it implies that NSP is satisfied of the order . We recall Theorem 1.2 in [3], where the noisy recovery performance for minimization is bounded based on the RIP constant. This error bound is associated with the sparse approximation error and the noise level.
Theorem 3** (Noisy recovery for minimization [3]).**
Let , , is sparse that minimizes over all sparse signals. If , be the solution of minimization, then it obeys
[TABLE]
where are some constants, which are determined by RIP constant . The form of these two constants terms are and .
III-D Tail bound of the sum of i.i.d. bounded Random variables
This exponential bound is similar in structure to Hoeffidings’ inequality. Proving this bound requires working with the moment generating function of a random variable.
Lemma 4**.**
Let be i.i.d. observations of bounded random variable : and the expectation exists, for any , then
[TABLE]
IV Theoretical Results for Bagging associated with sampling ratio and the number of estimates
IV-A Noisy Recovery for Employing Bagging in Sparse Regression
We derive the performance bound for employing Bagging in sparse regression, in which the final estimate is the average over multiple estimates solved individually from bootstrap samples. We give the theoretical results for the case that true signal is exactly sparse and the general case with no assumption of the sparsity level of the ground truth signal. Note that, the theorems are based on deterministic sensing matrix, measurements, and noise: , in which all vector norms are equivalent.
Theorem 5** (Bagging: Error bound for ).**
Let , , If under the assumption that, for s that generates a set of sensing matrices , there exists a constant that is relates to and : such that for all , . Let be the solution of Bagging, then for any , satisfies
[TABLE]
We also study the behavior of Bagging for a general signal , in which the performance involves the sparse approximation error. We use the vector to denote this error, and , where is the best -sparse approximation of the ground truth signal over all sparse signals.
Theorem 6** (Bagging: Error bound for general signal recovery).**
Let , , If under the assumption that, for s that generates a set of sensing matrices , there exists such that for all , . Let be the solution of Bagging, then for any , satisfies
[TABLE]
*where . *
Theorem 6 gives the performance bound for Bagging in sparse signal recovery without the sparse assumption, and it reduces to Theorem 5 when the sparse approximation error is zero .
We give the proof sketch that demonstrates the key idea to prove both Theorem 5 and Theorem 6. The main tools are Theorem 3 and Lemma 4. Some special treatments are required to deal with terms while proving Theorem 6. For more technical details, full proofs can be found in [17].
Proof Sketch: Similar to the sufficient condition in Theorem 3, the sufficient condition to analyze Bagging is that all matrices resulting from Bagging have well-behaved RIP constants of order bounded by a universal constant .
Let denote a generic multi-set containing elements and each element in is independent and identically distributed, obeying a discrete uniform distribution from sample space . The squared error function f({\boldsymbol{x}}\text{\scriptsize({\mathcal{I}})})=\|{\boldsymbol{x}}\text{\scriptsize({\mathcal{I}})}-\boldsymbol{x^{\star}}\|^{2}_{2}, where {\boldsymbol{x}}\text{\scriptsize({\mathcal{I}})} is the solution from minimization on : {\boldsymbol{x}}\text{\scriptsize({\mathcal{I}})}=\operatorname*{arg\,min}\|{\boldsymbol{x}}\|_{1}\;\text{ s.t. }\|{{\boldsymbol{y}}{\text{\scriptsize[{\mathcal{I}}] }}}-{{\boldsymbol{A}}{[{\mathcal{I}}]}}\|_{2}\leq\epsilon_{\mathcal{I}}. The squared errors from bootstrapped estimators are realizations generated i.i.d. from the distribution of f({\boldsymbol{x}}\text{\scriptsize({\mathcal{I}})}).
We proceed with the proof using Lemma 4. We choose the upper bound of the error to be a function of the expected value of noise power. We pick the bound relating to the the root of the expectation of squared error \sqrt{{\mathbb{E}}\|{{\boldsymbol{z}}\text{\scriptsize[{\mathcal{I}}]}}\|^{2}_{2}}=\sqrt{\frac{L}{m}}\|{\boldsymbol{z}}\|_{2}. Then we need to compute the upper bound and the lower bound for the random variable f({\boldsymbol{x}}\text{\scriptsize({\mathcal{I}})}). Since it is non-negative, we choose . The upper bound is obtained from Theorem 3 and then the maximum value is employed to further upper bound the noise level \|{\boldsymbol{z}}\text{\scriptsize{[{\mathcal{I}}_{j}]}}\|_{2}. Through this process, we obtain the inequality: , for some function .
The Bagging solution is the average of all bootstrapped estimators. The key inequality to establish is as follows:
[TABLE]
The first term is independent of the second term and it is true with probability by Jensens’ inequality. Then we successfully establish the relationship of error bound of the Bagging solution to the sum of squared errors of bootstrapped estimates. To obtain the bound for the second term, we follow the method described in the previous paragraph.
IV-B Parameters Selection Guided by the Theoretical Analysis
Besides analyzing error bounds for general signals whose sparsity levels might exceed , Theorem 6 can be used in analyzing cases when is not large enough for the sparsity level . Theorem 5 and 6 also guide us to optimal choices of parameters: the bootstrap sampling ratio and the number of estimates .
Both Theorem 5 and Theorem 6 show that increasing the number of estimates improves the result, by increasing the lower bound of certainty of the same performance. The growth rate of the certainty bound is decreasing with . We validate this in our numerical experiment: even though increasing improves the results, the performance tends to be flattened out for a large .
The sampling ratio affects the result through two factors. The first one is the the RIP constant, which in general decreases with increasing (proved in [18] with Gaussian assumption on sensing matrix). Since is a non-decreasing function of and a larger usually results in a smaller , then a larger in general results in a smaller . On the other hand, the second factor is the multiplier of the noise power term, which is , suggesting a smaller .
Combining these two factors indicates that the best ratio is somewhere in between a small and a large number. In the experiment results, we demonstrate that when is small, varying the bootstrap sampling ratio from creates peaks with the largest value at . The first factor, which relates to the RIP constant, is dominating in the stable case (when is sufficiently large), so that larger leads to better performance.
V Simulations
In this section, we perform sparse recovery on simulated data to study the performance of our algorithm. In our experiment, all entries of are i.i.d. samples from the standard normal distribution . The signal dimension and various numbers of measurements from to are explored. For the ground truth signals, their sparsity levels are all , and the non-zero entries are sampled from the standard Gaussian with their locations being generated uniformly at random. For the noise processes , entries are sampled i.i.d. from , with variance , where SNR represents the Signal to Noise Ratio. We add white Gaussian noise to make the dB. All numerical realizations have finite values. We use the ADMM [12] implementation of Lasso to solve all sparse regression problems, in which the parameter balances the least squares fit and the sparsity penalty for the case with as parameters.
We study how the bootstrap sampling ratio as well as the number of estimates affects the result. In our experiment, we take and from to . We report the Signal to Noise Ratio (SNR) as the error measure for recovery: averaged over independent trials. For all algorithms, we evaluate at different values from to and then select optimal values that give the maximum averaged SNR over all trials.
V-A Performance of Bagging, Bolasso and minimization
Bagging and Bolasso with the various parameters and minimization are studied. The results are plotted in Figure 1. The colored curves show the cases of Bagging with various number of estimates . The intersections of colored curves and the purple solid vertical lines at illustrates conventional Bagging with a full bootstrap rate. The grey circle highlights the best performance and the grey area highlights the optimal bootstrap ratio . The performance of minimization is depicted by the black dashed lines, while the best Bolasso performance is plotted using light green dashed lines. In those figures, for each condition with a choice of , the information available to Bagging and Bolasso algorithms are identical, and minimization always has access to all measurements.
From Figure 1, we see that when is small, Bagging can outperform minimization. As decreases, the margin increases. The important observation is that when the number of measurements is low ( is between to : , is the sparsity level), by using a reduced bootstrap ratio (), Bagging beats the conventional choice of the full ratio for all different choices of . Also with a reduced ratio and a small our algorithm is already quite robust and outperforms minimization by a large margin. When the number of measurements is moderate , Bagging still beats the baseline; however, the optimal parameters here are bootstrap ratio and the number of estimates . In this case, the reduced bootstrap ratio does not bring any performance improvement. Increasing the level measurement makes the base algorithm more stable and the advantage of Bagging starts decaying.
We perform the same experiments with higher number of measurements , and Table I illustrates the best performance for various schemes: minimization, the original Bagging scheme with a full bootstrap ratio, Bagging, and Bolasso with dB. For Bagging, the peak values are found among different choices of parameters and that we explored. We see that when the number of measurements is small (), Bagging outperforms minimization. The reduced bootstrap rate also improves conventional Bagging: the improvement is significant: on SNR when . When is moderate (), choosing reduced rates does not improve the performance compared to conventional Bagging. Bagging still outperforms minimization with smaller margins than the cases with small . While is large (), Bagging starts losing its advantage over minimization. Bolasso only performs similarly to other algorithms in the easiest case for an extremely large () where it slightly outperforms all other algorithms.
VI Conclusion
We extend the conventional Bagging scheme in sparse recovery with the bootstrap sampling ratio as adjustable parameters and derive error bounds for the algorithm associated with and the number of estimates . Bagging is particularly powerful when the number of measurements is small. Although this condition is notoriously difficult, both in terms of improving sparse recovery results and obtaining tight bounds of theoretical properties, Bagging outperforms minimization by a large margin (up to 367%). Moreover, the reduced sampling rate shows a performance improvement measured by the recovered SNR, and it is over the conventional Bagging algorithm by up to .
Our Bagging scheme achieves acceptable performance even with very small (around ) and relative small (around in our experimental study). The error bounds for Bagging predict that a smaller sampling rate can lead to performance improvement and increasing improves the certainty of the bound. Both are validated in our numerical simulation. For a sequential system, a reasonably large (around ) is enough to obtain an fairly good solution. For a parallel system that allows a large amount of processes to be run at the same time, a large is preferred since it in general gives a better result.
VII Acknowledgement
We would like to thank Dr. Dror Baron for insightful comments and suggestions, Dr. Cindy Rush for thoughtful feedbacks, and Nicholas Huang for efforts in helping polish, all towards improving the overall quality of our paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] S. Chen, D. L Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM review , 43(1):129–159, 2001.
- 2[2] R. Tibshirani. Regression shrinkage and selection via the Lasso. J. of the Royal Stat. Society. Series B , pages 267–288, 1996.
- 3[3] E. J Candes. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique , 346(9):589–592, 2008.
- 4[4] E. J Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. on info. theory , 52(2):489–509, 2006.
- 5[5] D. L Donoho. Compressed sensing. IEEE Trans. on info. theory , 52(4):1289–1306, 2006.
- 6[6] E. Candess and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse prob. , 23(3):969, 2007.
- 7[7] L. Breiman. Bagging predictors. Machine learning , 24(2):123–140, 1996.
- 8[8] P. Hall and R. J Samworth. Properties of bagged nearest neighbour classifiers. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67(3):363–379, 2005.
