Reducing Sampling Ratios Improves Bagging in Sparse Regression

Luoluo Liu; Sang Peter Chin; Trac D. Tran

arXiv:1812.08808·stat.ML·May 3, 2019

Reducing Sampling Ratios Improves Bagging in Sparse Regression

Luoluo Liu, Sang Peter Chin, Trac D. Tran

PDF

Open Access

TL;DR

This paper demonstrates that reducing the bootstrap sampling ratio in Bagging enhances sparse regression performance, especially with fewer measurements, outperforming traditional L1 minimization and Bolasso methods.

Contribution

It introduces a generalized Bagging framework with variable bootstrap ratios for sparse regression and provides theoretical analysis of performance limits.

Findings

01

Lower bootstrap ratio (60%-90%) improves recovery performance.

02

Reduced sampling rate increases SNR by up to 24%.

03

A small number of estimates (K=30) suffices for good results.

Abstract

Bagging, a powerful ensemble method from machine learning, improves the performance of unstable predictors. Although the power of Bagging has been shown mostly in classification problems, we demonstrate the success of employing Bagging in sparse regression over the baseline method (L1 minimization). The framework employs the generalized version of the original Bagging with various bootstrap ratios. The performance limits associated with different choices of bootstrap sampling ratio L/m and number of estimates K is analyzed theoretically. Simulation shows that the proposed method yields state-of-the-art recovery performance, outperforming L1 minimization and Bolasso in the challenging case of low levels of measurements. A lower L/m ratio (60% - 90%) leads to better performance, especially with a small number of measurements. With the reduced sampling rate, SNR improves over the original…

Tables1

Table 1. TABLE I: The performance of ℓ 1 subscript ℓ 1 \ell_{1} minimization and the best performance among all choices of L 𝐿 L and K 𝐾 K for Bagging, Bolasso methods with various total number of measurements m 𝑚 m . SNR = 0 dB absent 0 dB =0\mbox{{d}B} . All performances are measured by the averaged recovered SNR (dB)

	Small $m$			Moderate $m$				Large $m$		Very large $m$
The number of measurements $m$	50	75	100	125	150	175	200	500	1000	2000
$ℓ_{1}$ min.	0.12	0.57	1.00	1.70	2.19	2.61	2.97	6.53	9.46	12.55
Conventional Bagging (L/m=1)	0.45	0.94	1.29	1.86	2.29	2.70	3.01	6.22	9.06	12.10
Bagging	0.56	0.95	1.32	1.86	2.29	2.70	3.01	6.22	9.06	12.10
Bolasso	0.02	0.09	0.08	0.28	0.57	0.98	1.23	5.21	8.94	12.73

Equations20

P_{1}^{λ} : min λ ∥ x ∥_{1} + 0.5∥ y - A x ∥_{2}^{2} .

P_{1}^{λ} : min λ ∥ x ∥_{1} + 0.5∥ y - A x ∥_{2}^{2} .

x_{j}^{B} = x \in R^{n} arg min λ_{(L, K)} ∥ x ∥_{1} + 0.5∥ y [I_{j}] - A [I_{j}] x ∥_{2}^{2},

x_{j}^{B} = x \in R^{n} arg min λ_{(L, K)} ∥ x ∥_{1} + 0.5∥ y [I_{j}] - A [I_{j}] x ∥_{2}^{2},

\mbox B a g g in g : x^{B} = \frac{1}{K} j = 1 \sum K x_{j}^{B} .

\mbox B a g g in g : x^{B} = \frac{1}{K} j = 1 \sum K x_{j}^{B} .

∥ v [S] ∥_{1} < ∥ v [S^{c}] ∥_{1},

∥ v [S] ∥_{1} < ∥ v [S^{c}] ∥_{1},

(1 - δ_{s} (A)) ∥ v ∥_{2}^{2} \leq ∥ A v ∥_{2}^{2} \leq (1 + δ_{s} (A)) ∥ v ∥_{2}^{2} .

(1 - δ_{s} (A)) ∥ v ∥_{2}^{2} \leq ∥ A v ∥_{2}^{2} \leq (1 + δ_{s} (A)) ∥ v ∥_{2}^{2} .

∥ x^{ℓ_{1}} - x^{⋆} ∥_{2} \leq C_{0} (δ) s^{- 1/2} ∥ x_{0} - x^{⋆} ∥_{1} + C_{1} (δ) ϵ,

∥ x^{ℓ_{1}} - x^{⋆} ∥_{2} \leq C_{0} (δ) s^{- 1/2} ∥ x_{0} - x^{⋆} ∥_{1} + C_{1} (δ) ϵ,

P {i = 1 \sum n Y_{i} \geq n ξ} \leq exp {- \frac{2 n ( ξ - E Y ) ^{2}}{( b - a ) ^{2}}} .

P {i = 1 \sum n Y_{i} \geq n ξ} \leq exp {- \frac{2 n ( ξ - E Y ) ^{2}}{( b - a ) ^{2}}} .

P {∥ x^{B} - x^{⋆} ∥_{2} \leq C_{1} (δ_{(L, K)}) (\frac{L}{m} ∥ z ∥_{2} + τ)} \geq 1 - exp \frac{- 2 K τ ^{4}}{L ^{2} ∥ z ∥ _{\infty}^{4}} .

P {∥ x^{B} - x^{⋆} ∥_{2} \leq C_{1} (δ_{(L, K)}) (\frac{L}{m} ∥ z ∥_{2} + τ)} \geq 1 - exp \frac{- 2 K τ ^{4}}{L ^{2} ∥ z ∥ _{\infty}^{4}} .

P {∥ x^{B} - x^{⋆} ∥_{2} \leq C_{0} (δ_{L, K}) s^{- 1/2} ∥ e ∥_{1} + C_{1} (δ_{(L, K)}) (\frac{L}{m} ∥ z ∥_{2} + τ)} \geq 1 - exp \frac{- 2 K C _{1} ^{4} ( δ _{(L, K)} ) τ ^{4}}{( b ^{'} ) ^{2}},

P {∥ x^{B} - x^{⋆} ∥_{2} \leq C_{0} (δ_{L, K}) s^{- 1/2} ∥ e ∥_{1} + C_{1} (δ_{(L, K)}) (\frac{L}{m} ∥ z ∥_{2} + τ)} \geq 1 - exp \frac{- 2 K C _{1} ^{4} ( δ _{(L, K)} ) τ ^{4}}{( b ^{'} ) ^{2}},

P = \geq = = {∥ x^{B} - x^{⋆} ∥_{2}^{2} - ξ \leq 0} P {K ∥ x^{B} - x^{⋆} ∥_{2}^{2} - \sum_{j} f (x_{j}) + \sum_{j} f (x_{j}) - K ξ \leq 0} P {K ∥ x^{B} - x^{⋆} ∥_{2}^{2} - \sum_{j} f (x_{j}) \leq 0, \sum_{j} f (x_{j}) - K ξ \leq 0} P {K ∥ x^{B} - x^{⋆} ∥_{2}^{2} - \sum_{j} f (x_{j}) \leq 0} P {\sum_{j} f (x_{j}) - K ξ \leq 0} P {\sum_{j} ∥ x_{j}^{B} - x^{⋆} ∥_{2}^{2} - K ξ \leq 0} .

P = \geq = = {∥ x^{B} - x^{⋆} ∥_{2}^{2} - ξ \leq 0} P {K ∥ x^{B} - x^{⋆} ∥_{2}^{2} - \sum_{j} f (x_{j}) + \sum_{j} f (x_{j}) - K ξ \leq 0} P {K ∥ x^{B} - x^{⋆} ∥_{2}^{2} - \sum_{j} f (x_{j}) \leq 0, \sum_{j} f (x_{j}) - K ξ \leq 0} P {K ∥ x^{B} - x^{⋆} ∥_{2}^{2} - \sum_{j} f (x_{j}) \leq 0} P {\sum_{j} f (x_{j}) - K ξ \leq 0} P {\sum_{j} ∥ x_{j}^{B} - x^{⋆} ∥_{2}^{2} - K ξ \leq 0} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning

Full text

Reducing Sampling Ratios and Increasing Number of Estimates Improve Bagging in Sparse Regression

Luoluo LiuJ, Sang (Peter) ChinB,J, and Trac D. TranJ

[email protected] [email protected] [email protected]

J Department of Electrical Engineering, Johns Hopkins University, Baltimore, MD, 21210

B Department of Computer Science $\&$ Hariri Institute of Computing, Boston University, Boston, MA, 02215

Abstract

Bagging, a powerful ensemble method from machine learning, has shown the ability to improve the performance of unstable predictors in difficult practical settings. Although Bagging is most well-known for its application in classification problems, here we demonstrate that employing Bagging in sparse regression improves performance compared to the baseline method ( $\ell_{1}$ minimization). Although the original Bagging method uses a bootstrap sampling ratio of $1$ , such that the sizes of the bootstrap samples $L$ are the same as the total number of data points $m$ , we generalize the bootstrap sampling ratio to explore the optimal sampling ratios for various cases.

The performance limits associated with different choices of bootstrap sampling ratio $L/m$ and number of estimates $K$ are analyzed theoretically. Simulation results show that a lower $L/m$ ratio ( $0.6-0.9$ ) leads to better performance than the conventional choice ( $L/m=1$ ), especially in challenging cases with low levels of measurements. With the reduced sampling rate, SNR improves over the original Bagging method by up to $24\%$ and over the base algorithm $\ell_{1}$ minimization by up to $367\%$ . With a properly chosen sampling ratio, a reasonably small number of estimates ( $K=30$ ) gives a satisfying result, although increasing $K$ is discovered to always improve or at least maintain performance.

Index Terms:

Bootstrap, Bagging, Sparse Regression, Sparse Recovery, $\ell_{1}$ minimization, LASSO

I Introduction

Compressed Sensing (CS) and Sparse Regression studies solving the linear inverse problem in the form of least squares with an additional sparsity-promoting penalty term. Formally speaking, the measurements vector ${\boldsymbol{y}}\in\mathbb{R}^{m}$ is generated by ${\boldsymbol{y}}={\boldsymbol{A}}{\boldsymbol{x}}+{\boldsymbol{z}}$ , where ${\boldsymbol{A}}\in\mathbb{R}^{m\times n}$ is the sensing matrix, ${\boldsymbol{x}}\in\mathbb{R}^{n}$ is a vector of sparse coefficients with very few non-zero entries, and ${\boldsymbol{z}}$ is a noise vector with bounded energy. The problem of interest is finding the sparse vector ${\boldsymbol{x}}$ given ${\boldsymbol{A}}$ as well as ${\boldsymbol{y}}$ . Among various choices of sparse regularizers, the $\ell_{1}$ norm is the most commonly used. The noiseless case is referred to as Basis Pursuit (BP) whereas the noisy version is known as basis pursuit denoising [1], or least absolute shrinkage and selection operator (Lasso) [2]:

[TABLE]

The performance of $\ell_{1}$ minimization in recovering the true sparse solution has been thoroughly investigated in the CS literature [3, 4, 5, 6]. CS theory reveals that if the sensing matrix ${\boldsymbol{A}}$ has good properties, then BP recovers the ground truth and the LASSO solution is close enough to the true solution with high probability [3].

Classical sparse regression recovery based on $\ell_{1}$ minimization solves the problem with all available measurements. In practice, it is often the case that not all measurements are available or required for recovery. Some measurements might be severely corrupted/missing or adversarial samples that break down the algorithm. These issues could lead to the failure of the sparse regression algorithm.

The Bagging procedure [7] proposed by Breiman is an efficient parallel ensemble method that improves the performance of unstable predictors. In Bagging, we first generate a bootstrap sample by randomly drawing $m$ samples uniformly with replacement from all $m$ data points. We repeat the process $K$ times and generate $K$ bootstrap samples. Then one bootstrapped estimator is computed for each bootstrap sample, and the final Bagged estimator is the average of all $K$ bootstrapped estimators.

Applying Bagging to find a sparse vector with a specific symmetric pattern was shown empirically to reduce estimation error when the sparsity level $s$ is high [7] in a forward subset selection problem. This experiment shows the possibility of using Bagging to improve other sparse regression methods on general sparse signals. Although the well-known conventional Bagging method uses the bootstrap ratio $100\%$ , some follow-up works have shown empirically that lower ratios improve Bagging in some classic classifiers: Nearest Neighbour Classifier [8], CART Trees [9], Linear SVM, LDA, and Logistic Linear Classifier [10]. Based on this success, we hypothesize that reducing the bootstrap ratio will also improve performance of Bagging in sparse regression. Therefore, we set up the framework with a generic bootstrap ratio and study its behavior with various bootstrap ratios.

In this paper, we use the notation $L$ as the sizes of bootstrap samples, $m$ as the number of all measurements, and $K$ as the number of estimates. (i) We demonstrate the generalized Bagging framework with bootstrap ratio $L/m$ and number of estimates $K$ as parameters. (ii) We explore the theoretical properties associated with finite $L/m$ and $K$ . (iii) We present simulation results with various parameters $L/m$ and $K$ and compare the performances of $\ell_{1}$ minimization, conventional Bagging, and Bolasso [11], another modern technique that incorporates Bagging into sparse recovery. An important discovery is that in challenging cases with small $m$ , Bagging with a ratio $L/m$ that is smaller than the conventional ratio $1$ can lead to better performance.

II Proposed Method: Bagging in Sparse Regression

Our proposed method is sparse recovery using a generalized Bagging procedure. It is accomplished in three steps. First, **we generate $K$ bootstrap samples, each of size $L$ , randomly sampled uniformly and independently with replacement from the original $m$ data points. ** This results in $K$ measurements and sensing matrices pairs: $\{{{\boldsymbol{y}}\text{\scriptsize$ [{\mathcal{I}}{1}] $}},{{\boldsymbol{A}}{[{\mathcal{I}}_{1}]}}\},\{{{\boldsymbol{y}}\text{\scriptsize$ [{\mathcal{I}}{2}] $}},{{\boldsymbol{A}}{[{\mathcal{I}}_{2}]}}\}....,\{{{\boldsymbol{y}}\text{\scriptsize$ [{\mathcal{I}}_{K}] $}},{{\boldsymbol{A}}{[{\mathcal{I}}_{K}]}}\}$ . We use the notation $(\cdot)[{\mathcal{I}}]$ on matrices or vectors to denote retaining only the rows supported on ${\mathcal{I}}$ and throwing away all other rows in the complement ${\mathcal{I}}^{c}$ . Second, we solve the sparse recovery problem independently on each of those pairs; mathematically, for all $j=1,2,..,K$ , we find

[TABLE]

where the parameter $\lambda_{(L,K)}$ is the balancing parameter of the least squares fit and the sparsity penalty for $(L,K)$ as the parameter choice for Bagging. The proposed approach (2) is a Lasso problem, and numerous optimization methods can be used to solve it, such as [12, 13, 14, 15].

Finally, the Bagging solution is obtained by averaging all $K$ estimators from solving (2):

[TABLE]

Compared to the $\ell_{1}$ minimization solution obtained from the usage of all the measurements, the bagged solution ${\boldsymbol{x^{B}}}$ is obtained by resampling without increasing the number of original measurements. We will show that in some cases, the bagged solution outperforms the base $\ell_{1}$ minimization solution.

III Preliminaries

We summarize the theoretical results of CS theory which we need to analyze our algorithm mathematically. We introduce the Null Space Property (NSP), as well as the Restricted Isometry Property (RIP). We also provide the tail bound of the sum of i.i.d. bounded random variables, which is needed to prove our theorems.

III-A Null Space Property (NSP)

The NSP [16] for standard sparse recovery characterizes the necessary and sufficient conditions for successful sparse recovery using $\ell_{1}$ minimization.

Theorem 1 (NSP).

Every $s-$ sparse signal ${\boldsymbol{x}}\in\mathbb{R}^{n}$ is a unique solution to $\mathrm{\mathbf{P_{1}}}:\ \min\|{\boldsymbol{x}}\|_{1}\;\text{ s.t. }{\boldsymbol{y}}={\boldsymbol{A}}{\boldsymbol{x}}$ if and only if ${\boldsymbol{A}}$ satisfies NSP of order $s$ . Namely, if for all ${\boldsymbol{v}}\in\textup{Null}{({\boldsymbol{A}})}\backslash\{{\mathbf{0}}\}$ , such that for any set $\mathcal{S}$ of cardinality less than or equals to the sparsity level $s$ $:\mathcal{S}\subset\{1,2,..,n\},\text{card}(\mathcal{S})\leq s$ , the following is satisfied:

[TABLE]

where ${\boldsymbol{v}}\text{\footnotesize{$ [\mathcal{S}] $}}$ only has the vector values on an index set $\mathcal{S}$ and zero elsewhere.

III-B Restricted Isometry Property (RIP)

Although NSP directly characterizes the ability of success for sparse recovery, checking the NSP condition is computationally intractable. It is also not suitable to use NSP for quantifying performance in noisy conditions since it is a binary (True or False) metric instead of a continuous range. The Restricted isometry property (RIP) [3] is introduced to overcome these difficulties.

Definition 2 (RIP).

A matrix ${\boldsymbol{A}}$ with $\ell_{2}$ -normalized columns satisfies RIP of order $s$ if there exists a constant $\delta_{s}({\boldsymbol{A}})\in[0,1)$ such that for every $s-$ sparse ${\boldsymbol{v}}\in\mathbb{R}^{n}$ , the following is satisfied:

[TABLE]

III-C Noisy Recovery bounds based on RIP constants

It is known that satisfying the RIP conditions implies that the NSP conditions are also satisfied for sparse recovery [3]. More specifically, if the RIP constant of order $2s$ is strictly less than $\sqrt{2}-1$ , then it implies that NSP is satisfied of the order $s$ . We recall Theorem 1.2 in [3], where the noisy recovery performance for $\ell_{1}$ minimization is bounded based on the RIP constant. This error bound is associated with the $s-$ sparse approximation error and the noise level.

Theorem 3 (Noisy recovery for $\ell_{1}$ minimization [3]).

Let ${\boldsymbol{y}}={\boldsymbol{A}}\boldsymbol{x^{\star}}+{\boldsymbol{z}}$ , $\|{\boldsymbol{z}}\|_{2}\leq\epsilon$ , ${\boldsymbol{x}}_{0}$ is $s-$ sparse that minimizes $\|{\boldsymbol{x}}-\boldsymbol{x^{\star}}\|$ over all $s-$ sparse signals. If $\delta_{2s}({\boldsymbol{A}})\leq\delta<\sqrt{2}-1$ , ${\boldsymbol{x}}^{\boldsymbol{\ell_{1}}}$ be the solution of $\ell_{1}$ minimization, then it obeys

[TABLE]

where ${\mathcal{C}_{0}}(\cdot),{\mathcal{C}_{1}}(\cdot)$ are some constants, which are determined by RIP constant $\delta_{2s}$ . The form of these two constants terms are ${\mathcal{C}_{0}}(\delta)=\frac{2(1-(1-\sqrt{2})\delta)}{1-(1+\sqrt{2})\delta}$ and ${\mathcal{C}_{1}}(\delta)=\frac{4\sqrt{1+\delta}}{1-(1+\sqrt{2})\delta}$ .

III-D Tail bound of the sum of i.i.d. bounded Random variables

This exponential bound is similar in structure to Hoeffidings’ inequality. Proving this bound requires working with the moment generating function of a random variable.

Lemma 4.

Let $Y_{1},Y_{2},...,Y_{n}$ be i.i.d. observations of bounded random variable $Y$ : $a\leq Y\leq b$ and the expectation ${\mathbb{E}}Y$ exists, for any $\xi>0$ , then

[TABLE]

IV Theoretical Results for Bagging associated with sampling ratio $L/m$ and the number of estimates $K$

IV-A Noisy Recovery for Employing Bagging in Sparse Regression

We derive the performance bound for employing Bagging in sparse regression, in which the final estimate is the average over multiple estimates solved individually from bootstrap samples. We give the theoretical results for the case that true signal $\boldsymbol{x^{\star}}$ is exactly $s-$ sparse and the general case with no assumption of the sparsity level of the ground truth signal. Note that, the theorems are based on deterministic sensing matrix, measurements, and noise: ${\boldsymbol{A}},{\boldsymbol{y}},{\boldsymbol{z}}$ , in which all vector norms are equivalent.

Theorem 5 (Bagging: Error bound for $\|\boldsymbol{x^{\star}}\|_{0}=s$ ).

Let ${\boldsymbol{y}}={\boldsymbol{A}}\boldsymbol{x^{\star}}+{\boldsymbol{z}}$ , $\|{\boldsymbol{z}}\|_{2}<\infty$ , If under the assumption that, for $\{{\mathcal{I}}_{j}\}$ s that generates a set of sensing matrices ${\boldsymbol{A}}{[{\mathcal{I}}_{1}]},{\boldsymbol{A}}{[{\mathcal{I}}_{2}]},...,{\boldsymbol{A}}{[{\mathcal{I}}_{K}]}$ , there exists a constant that is relates to $L$ and $K$ : $\delta_{(L,K)}$ such that for all $j\in\{1,2,...,K\}$ , $\delta_{2s}({\boldsymbol{A}}{[{\mathcal{I}}_{j}]})\leq\delta_{(L,K)}<\sqrt{2}-1$ . Let ${\boldsymbol{x^{B}}}$ be the solution of Bagging, then for any $\tau>0$ , ${\boldsymbol{x^{B}}}$ satisfies

[TABLE]

We also study the behavior of Bagging for a general signal $\boldsymbol{x^{\star}},\|\boldsymbol{x^{\star}}\|_{0}\geq s$ , in which the performance involves the $s-$ sparse approximation error. We use the vector ${\boldsymbol{e}}$ to denote this error, and ${\boldsymbol{e}}=\boldsymbol{x^{\star}}-{\boldsymbol{x}}_{0}$ , where ${\boldsymbol{x}}_{0}$ is the best $s$ -sparse approximation of the ground truth signal over all $s-$ sparse signals.

Theorem 6 (Bagging: Error bound for general signal recovery).

Let ${\boldsymbol{y}}={\boldsymbol{A}}\boldsymbol{x^{\star}}+{\boldsymbol{z}}$ , $\|{\boldsymbol{z}}\|_{2}<\infty$ , If under the assumption that, for $\{{\mathcal{I}}_{j}\}$ s that generates a set of sensing matrices ${\boldsymbol{A}}{[{\mathcal{I}}_{1}]},{\boldsymbol{A}}{[{\mathcal{I}}_{2}]},...,{\boldsymbol{A}}{[{\mathcal{I}}_{K}]}$ , there exists $\delta_{(L,K)}$ such that for all $j\in\{1,2,...,K\}$ , $\delta_{2s}({\boldsymbol{A}}{[{\mathcal{I}}_{j}]})\leq\delta_{(L,K)}<\sqrt{2}-1$ . Let ${\boldsymbol{x^{B}}}$ be the solution of Bagging, then for any $\tau>0$ , ${\boldsymbol{x^{B}}}$ satisfies

[TABLE]

*where $b^{\prime}=({\mathcal{C}_{0}}(\delta_{(L,K)})s^{-1/2}\|{\boldsymbol{e}}\|_{1}+{{\mathcal{C}_{1}}(\delta_{(L,K)})}\sqrt{L}\|{\boldsymbol{z}}\|_{\infty})^{2}$ . *

Theorem 6 gives the performance bound for Bagging in sparse signal recovery without the $s-$ sparse assumption, and it reduces to Theorem 5 when the $s-$ sparse approximation error is zero $\|{\boldsymbol{e}}\|_{1}=0$ .

We give the proof sketch that demonstrates the key idea to prove both Theorem 5 and Theorem 6. The main tools are Theorem 3 and Lemma 4. Some special treatments are required to deal with terms while proving Theorem 6. For more technical details, full proofs can be found in [17].

Proof Sketch: Similar to the sufficient condition in Theorem 3, the sufficient condition to analyze Bagging is that all matrices resulting from Bagging have well-behaved RIP constants of order $2s$ bounded by a universal constant $\delta$ .

Let ${\mathcal{I}}$ denote a generic multi-set containing $L$ elements and each element in ${\mathcal{I}}$ is independent and identically distributed, obeying a discrete uniform distribution from sample space $\{1,2,..,m\}$ . The squared error function $f({\boldsymbol{x}}\text{\scriptsize$ ({\mathcal{I}}) $})=\|{\boldsymbol{x}}\text{\scriptsize$ ({\mathcal{I}}) $}-\boldsymbol{x^{\star}}\|^{2}_{2}$ , where ${\boldsymbol{x}}\text{\scriptsize$ ({\mathcal{I}}) $}$ is the solution from $\ell_{1}$ minimization on ${\mathcal{I}}$ : ${\boldsymbol{x}}\text{\scriptsize$ ({\mathcal{I}}) $}=\operatorname*{arg\,min}\|{\boldsymbol{x}}\|_{1}\;\text{ s.t. }\|{{\boldsymbol{y}}{\text{\scriptsize$ [{\mathcal{I}}] $}}}-{{\boldsymbol{A}}{[{\mathcal{I}}]}}\|_{2}\leq\epsilon_{\mathcal{I}}$ . The squared errors from $K$ bootstrapped estimators $f({{\boldsymbol{x}}_{j}})=\|{\boldsymbol{x}}^{\boldsymbol{B}}_{j}-\boldsymbol{x^{\star}}\|^{2}_{2},j=1,2,...,K$ are realizations generated i.i.d. from the distribution of $f({\boldsymbol{x}}\text{\scriptsize$ ({\mathcal{I}}) $})$ .

We proceed with the proof using Lemma 4. We choose the upper bound of the error to be a function of the expected value of noise power. We pick the bound $\xi$ relating to the the root of the expectation of squared error $\sqrt{{\mathbb{E}}\|{{\boldsymbol{z}}\text{\scriptsize$ [{\mathcal{I}}] $}}\|^{2}_{2}}=\sqrt{\frac{L}{m}}\|{\boldsymbol{z}}\|_{2}$ . Then we need to compute the upper bound $b$ and the lower bound $a$ for the random variable $f({\boldsymbol{x}}\text{\scriptsize$ ({\mathcal{I}}) $})$ . Since it is non-negative, we choose $a=0$ . The upper bound $b$ is obtained from Theorem 3 and then the maximum value $\|{\boldsymbol{z}}\|_{\infty}$ is employed to further upper bound the noise level $\|{\boldsymbol{z}}\text{\scriptsize{$ [{\mathcal{I}}_{j}] $}}\|_{2}$ . Through this process, we obtain the inequality: ${\mathbb{P}}\{\sum_{j}\|{\boldsymbol{x}}^{\boldsymbol{B}}_{j}-\boldsymbol{x^{\star}}\|^{2}_{2}-K\xi\leq 0\}\geq g({\mathbb{E}}(f({{\boldsymbol{x}}}),b,a)$ , for some function $g$ .

The Bagging solution is the average of all bootstrapped estimators. The key inequality to establish is as follows:

[TABLE]

The first term is independent of the second term and it is true with probability $1$ by Jensens’ inequality. Then we successfully establish the relationship of error bound of the Bagging solution to the sum of squared errors of bootstrapped estimates. To obtain the bound for the second term, we follow the method described in the previous paragraph.

IV-B Parameters Selection Guided by the Theoretical Analysis

Besides analyzing error bounds for general signals whose sparsity levels might exceed $s$ , Theorem 6 can be used in analyzing cases when $m$ is not large enough for the sparsity level $s$ . Theorem 5 and 6 also guide us to optimal choices of parameters: the bootstrap sampling ratio $L/m$ and the number of estimates $K$ .

Both Theorem 5 and Theorem 6 show that increasing the number of estimates $K$ improves the result, by increasing the lower bound of certainty of the same performance. The growth rate of the certainty bound is decreasing with $K$ . We validate this in our numerical experiment: even though increasing $K$ improves the results, the performance tends to be flattened out for a large $K$ .

The sampling ratio $L/m$ affects the result through two factors. The first one is the the RIP constant, which in general decreases with increasing $L$ (proved in [18] with Gaussian assumption on sensing matrix). Since ${\mathcal{C}_{1}}(\delta)$ is a non-decreasing function of $\delta$ and a larger $L$ usually results in a smaller $\delta$ , then a larger $L$ in general results in a smaller ${\mathcal{C}_{1}}(\delta)$ . On the other hand, the second factor is the multiplier of the noise power term, which is $\sqrt{L/m}$ , suggesting a smaller $L$ .

Combining these two factors indicates that the best $L/m$ ratio is somewhere in between a small and a large number. In the experiment results, we demonstrate that when $m$ is small, varying the bootstrap sampling ratio $L/m$ from $0-1$ creates peaks with the largest value at $L/m<1$ . The first factor, which relates $L$ to the RIP constant, is dominating in the stable case (when $m$ is sufficiently large), so that larger $L$ leads to better performance.

V Simulations

In this section, we perform sparse recovery on simulated data to study the performance of our algorithm. In our experiment, all entries of ${\boldsymbol{A}}\in\mathbb{R}^{m\times n}$ are i.i.d. samples from the standard normal distribution $\mathcal{N}(0,1)$ . The signal dimension $n=200$ and various numbers of measurements from $50$ to $2000$ are explored. For the ground truth signals, their sparsity levels are all $s=50$ , and the non-zero entries are sampled from the standard Gaussian with their locations being generated uniformly at random. For the noise processes ${\boldsymbol{z}}$ , entries are sampled i.i.d. from $\mathcal{N}(0,\sigma^{2})$ , with variance $\sigma^{2}=10^{-\text{SNR}/10}\|{\boldsymbol{A}}{\boldsymbol{x}}\|_{2}^{2}$ , where SNR represents the Signal to Noise Ratio. We add white Gaussian noise to make the $\text{SNR}=0$ dB. All numerical realizations have finite values. We use the ADMM [12] implementation of Lasso to solve all sparse regression problems, in which the parameter $\lambda_{(L,K)}$ balances the least squares fit and the sparsity penalty for the case with $(L,K)$ as parameters.

We study how the bootstrap sampling ratio $L/m$ as well as the number of estimates $K$ affects the result. In our experiment, we take $K=30,50,100$ and $L/m$ from $0.1$ to $1$ . We report the Signal to Noise Ratio (SNR) as the error measure for recovery: $\text{SNR}({\boldsymbol{x}},\boldsymbol{x^{\star}})=10\log_{10}\|{\boldsymbol{x}}-\boldsymbol{x^{\star}}\|_{2}^{2}/\|\boldsymbol{x^{\star}}\|_{2}^{2}$ averaged over $20$ independent trials. For all algorithms, we evaluate $\lambda_{(L,K)}$ at different values from $.01$ to $200$ and then select optimal values that give the maximum averaged SNR over all trials.

V-A Performance of Bagging, Bolasso and $\ell_{1}$ minimization

Bagging and Bolasso with the various parameters $K,L$ and $\ell_{1}$ minimization are studied. The results are plotted in Figure 1. The colored curves show the cases of Bagging with various number of estimates $K$ . The intersections of colored curves and the purple solid vertical lines at $L/m=1$ illustrates conventional Bagging with a full bootstrap rate. The grey circle highlights the best performance and the grey area highlights the optimal bootstrap ratio $L/m$ . The performance of $\ell_{1}$ minimization is depicted by the black dashed lines, while the best Bolasso performance is plotted using light green dashed lines. In those figures, for each condition with a choice of $L,K$ , the information available to Bagging and Bolasso algorithms are identical, and $\ell_{1}$ minimization always has access to all $m$ measurements.

From Figure 1, we see that when $m$ is small, Bagging can outperform $\ell_{1}$ minimization. As $m$ decreases, the margin increases. The important observation is that when the number of measurements is low ( $m$ is between $s$ to $2s$ : $50-100$ , $s$ is the sparsity level), by using a reduced bootstrap ratio $L/m$ ( $60\%-90\%$ ), Bagging beats the conventional choice of the full ratio $1$ for all different choices of $K$ . Also with a reduced ratio and a small $K$ our algorithm is already quite robust and outperforms $\ell_{1}$ minimization by a large margin. When the number of measurements is moderate $m=3s=150$ , Bagging still beats the baseline; however, the optimal parameters here are bootstrap ratio $L/m=1$ and the number of estimates $K=100$ . In this case, the reduced bootstrap ratio does not bring any performance improvement. Increasing the level measurement makes the base algorithm more stable and the advantage of Bagging starts decaying.

We perform the same experiments with higher number of measurements $m$ , and Table I illustrates the best performance for various schemes: $\ell_{1}$ minimization, the original Bagging scheme with a full bootstrap ratio, Bagging, and Bolasso with $\text{SNR}=0$ dB. For Bagging, the peak values are found among different choices of parameters $K$ and $L$ that we explored. We see that when the number of measurements $m$ is small ( $50-100$ ), Bagging outperforms $\ell_{1}$ minimization. The reduced bootstrap rate also improves conventional Bagging: the improvement is significant: $24\%$ on SNR when $m=50$ . When $m$ is moderate ( $125-200$ ), choosing reduced rates does not improve the performance compared to conventional Bagging. Bagging still outperforms $\ell_{1}$ minimization with smaller margins than the cases with small $m$ . While $m$ is large ( $\geq 500$ ), Bagging starts losing its advantage over $\ell_{1}$ minimization. Bolasso only performs similarly to other algorithms in the easiest case for an extremely large $m$ ( $=2000$ ) where it slightly outperforms all other algorithms.

VI Conclusion

We extend the conventional Bagging scheme in sparse recovery with the bootstrap sampling ratio $L/m$ as adjustable parameters and derive error bounds for the algorithm associated with $L/m$ and the number of estimates $K$ . Bagging is particularly powerful when the number of measurements $m$ is small. Although this condition is notoriously difficult, both in terms of improving sparse recovery results and obtaining tight bounds of theoretical properties, Bagging outperforms $\ell_{1}$ minimization by a large margin (up to 367%). Moreover, the reduced sampling rate shows a performance improvement measured by the recovered SNR, and it is over the conventional Bagging algorithm by up to $24\%$ .

Our Bagging scheme achieves acceptable performance even with very small $L/m$ (around $0.6$ ) and relative small $K$ (around $30$ in our experimental study). The error bounds for Bagging predict that a smaller sampling rate $L/m$ can lead to performance improvement and increasing $K$ improves the certainty of the bound. Both are validated in our numerical simulation. For a sequential system, a reasonably large $K$ (around $30$ ) is enough to obtain an fairly good solution. For a parallel system that allows a large amount of processes to be run at the same time, a large $K$ is preferred since it in general gives a better result.

VII Acknowledgement

We would like to thank Dr. Dror Baron for insightful comments and suggestions, Dr. Cindy Rush for thoughtful feedbacks, and Nicholas Huang for efforts in helping polish, all towards improving the overall quality of our paper.

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Chen, D. L Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM review , 43(1):129–159, 2001.
2[2] R. Tibshirani. Regression shrinkage and selection via the Lasso. J. of the Royal Stat. Society. Series B , pages 267–288, 1996.
3[3] E. J Candes. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique , 346(9):589–592, 2008.
4[4] E. J Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. on info. theory , 52(2):489–509, 2006.
5[5] D. L Donoho. Compressed sensing. IEEE Trans. on info. theory , 52(4):1289–1306, 2006.
6[6] E. Candess and J. Romberg. Sparsity and incoherence in compressive sampling. Inverse prob. , 23(3):969, 2007.
7[7] L. Breiman. Bagging predictors. Machine learning , 24(2):123–140, 1996.
8[8] P. Hall and R. J Samworth. Properties of bagged nearest neighbour classifiers. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67(3):363–379, 2005.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Reducing Sampling Ratios and Increasing Number of Estimates Improve Bagging in Sparse Regression

Abstract

Index Terms:

I Introduction

II Proposed Method: Bagging in Sparse Regression

III Preliminaries

III-A Null Space Property (NSP)

Theorem 1** (NSP).**

III-B Restricted Isometry Property (RIP)

Definition 2** (RIP).**

III-C Noisy Recovery bounds based on RIP constants

Theorem 3** (Noisy recovery for ℓ1\ell_{1}ℓ1​ minimization [3]).**

III-D Tail bound of the sum of i.i.d. bounded Random variables

Lemma 4**.**

IV Theoretical Results for Bagging associated with sampling ratio L/mL/mL/m and the number of estimates KKK

IV-A Noisy Recovery for Employing Bagging in Sparse Regression

Theorem 5** (Bagging: Error bound for ∥x⋆∥0=s\|\boldsymbol{x^{\star}}\|_{0}=s∥x⋆∥0​=s ).**

Theorem 6** (Bagging: Error bound for general signal recovery).**

IV-B Parameters Selection Guided by the Theoretical Analysis

V Simulations

V-A Performance of Bagging, Bolasso and ℓ1\ell_{1}ℓ1​ minimization

VI Conclusion

VII Acknowledgement

Theorem 1 (NSP).

Definition 2 (RIP).

Theorem 3 (Noisy recovery for $\ell_{1}$ minimization [3]).

Lemma 4.

IV Theoretical Results for Bagging associated with sampling ratio $L/m$ and the number of estimates $K$

Theorem 5 (Bagging: Error bound for $\|\boldsymbol{x^{\star}}\|_{0}=s$ ).

Theorem 6 (Bagging: Error bound for general signal recovery).

V-A Performance of Bagging, Bolasso and $\ell_{1}$ minimization