Nonasymptotic estimation and support recovery for high dimensional sparse covariance matrices
Adam B Kashlak, Linglong Kong

TL;DR
This paper introduces a flexible, nonasymptotic framework for estimating high-dimensional sparse covariance matrices using concentration inequalities, improving support recovery and outperforming existing methods in simulations.
Contribution
It develops a general, distribution-agnostic approach for covariance estimation with confidence sets, extending thresholding techniques and optimizing support recovery.
Findings
Superior performance in simulations compared to existing methods
Effective support recovery with controlled false positive rate
Applicable to a wide range of estimators and distributional assumptions
Abstract
We propose a general framework for nonasymptotic covariance matrix estimation making use of concentration inequality-based confidence sets. We specify this framework for the estimation of large sparse covariance matrices through incorporation of past thresholding estimators with key emphasis on support recovery. This technique goes beyond past results for thresholding estimators by allowing for a wide range of distributional assumptions beyond merely sub-Gaussian tails. This methodology can furthermore be adapted to a wide range of other estimators and settings. The usage of nonasymptotic dimension-free confidence sets yields good theoretical performance. Through extensive simulations, it is demonstrated to have superior performance when compared with other such methods. In the context of support recovery, we are able to specify a false positive rate and optimize to maximize the true…
| False Positive % | True Positive % | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Dimension | 50 | 100 | 200 | 500 | 50 | 100 | 200 | 500 | |
| CoM 1% | 0.0 | 0.1 | 0.3 | 1.0 | 0.0 | 7.7 | 20.7 | 32.0 | |
| CoM 5% | 1.0 | 2.2 | 3.5 | 4.7 | 33.1 | 42.9 | 51.5 | 56.0 | |
| PDS | 3.4 | 3.4 | 3.4 | 3.4 | 50.0 | 50.0 | 51.5 | 50.6 | |
| Hard | 0.0 | 0.0 | 0.0 | 0.0 | 0.3 | 0.0 | 0.0 | 0.0 | |
| Soft | 2.0 | 0.7 | 0.2 | 0.0 | 38.5 | 25.4 | 16.2 | 7.5 | |
| SCAD | 2.1 | 0.7 | 0.3 | 0.0 | 39.0 | 26.0 | 16.4 | 7.5 | |
| Adpt | 0.3 | 0.1 | 0.0 | 0.0 | 17.4 | 10.0 | 5.8 | 2.0 | |
| False Positive % | True Positive % | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Dimension | 50 | 100 | 200 | 500 | 50 | 100 | 200 | 500 | |
| CoM 1% | 0.2 | 0.4 | 0.7 | 1.1 | 4.5 | 9.2 | 13.0 | 17.2 | |
| CoM 5% | 2.2 | 3.3 | 4.1 | 4.7 | 22.8 | 29.3 | 32.1 | 34.1 | |
| PDS | 12.4 | 12.7 | 12.2 | 12.2 | 51.0 | 51.5 | 51.0 | 51.2 | |
| Hard | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
| Soft | 1.2 | 0.4 | 0.2 | 0.0 | 11.3 | 1.8 | 0.0 | 0.0 | |
| SCAD | 0.8 | 0.3 | 0.2 | 0.0 | 8.6 | 0.0 | 0.0 | 0.0 | |
| Adpt | 0.2 | 0.1 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
| non-zero (%) | CoM 10% | CoM 5% | CoM 1% | PDS |
|---|---|---|---|---|
| Informative | 30.3% | 25.6% | 8.5% | 47.3% |
| Uninformative | 5.4% | 2.7% | 0.4% | 15.6% |
| Hard | Soft | SCAD | Adpt | |
| Informative | 6.0% | 24.7% | 21.3% | 9.9% |
| Uninformative | 0.3% | 2.3% | 1.8% | 0.7% |
| False Positive % | True Positive % | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Dimension | 50 | 100 | 200 | 500 | 50 | 100 | 200 | 500 | |
| CoM 1% | 0.0 | 0.1 | 0.3 | 0.9 | 0.0 | 7.2 | 17.5 | 30.7 | |
| CoM 5% | 0.9 | 1.9 | 3.2 | 4.4 | 28.9 | 41.1 | 49.0 | 54.5 | |
| PDS | 3.4 | 3.4 | 3.4 | 3.4 | 50.2 | 50.3 | 50.8 | 50.6 | |
| Hard | 0.0 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | |
| Soft | 2.4 | 0.8 | 0.2 | 0.0 | 44.2 | 29.3 | 18.1 | 9.5 | |
| SCAD | 1.8 | 1.0 | 0.5 | 0.1 | 40.3 | 33.3 | 24.1 | 13.1 | |
| Adpt | 0.2 | 0.1 | 0.1 | 0.0 | 16.5 | 9.4 | 5.5 | 2.5 | |
| MA Matrix | ||||||
|---|---|---|---|---|---|---|
| Empirical | Diagonal | Hard | Soft | SCAD | LASSO | |
| 30 | 1.32 (0.14) | 0.72 (0.04) | 0.72 (0.09) | 0.70 (0.06) | 0.63 (0.07) | 0.64 (0.07) |
| 100 | 3.03 (0.19) | 0.77 (0.04) | 0.87 (0.10) | 0.85 (0.04) | 0.73 (0.06) | 0.77 (0.06) |
| 200 | 4.92 (0.21) | 0.79 (0.04) | 0.95 (0.11) | 0.91 (0.03) | 0.79 (0.06) | 0.82 (0.05) |
| 500 | 9.73 (0.25) | 0.83 (0.05) | 1.06 (0.11) | 0.98 (0.02) | 0.88 (0.06) | 0.88 (0.05) |
| AR Matrix | ||||||
| Empirical | Diagonal | Hard | Soft | SCAD | LASSO | |
| 30 | 1.33 (0.16) | 0.90 (0.04) | 0.79 (0.10) | 0.83 (0.07) | 0.74 (0.09) | 0.76 (0.09) |
| 100 | 3.06 (0.21) | 0.94 (0.03) | 0.95 (0.08) | 1.02 (0.04) | 0.86 (0.05) | 0.92 (0.05) |
| 200 | 4.99 (0.21) | 0.95 (0.02) | 1.00 (0.09) | 1.09 (0.03) | 0.92 (0.04) | 0.97 (0.03) |
| 500 | 9.80 (0.26) | 0.97 (0.02) | 1.04 (0.08) | 1.16 (0.02) | 0.99 (0.04) | 1.04 (0.03) |
| MA Matrix | ||||||
|---|---|---|---|---|---|---|
| Empirical | Diagonal | Hard | Soft | SCAD | LASSO | |
| 30 | 2.23 (0.55) | 0.86 (0.12) | 0.94 (0.23) | 0.93 (0.11) | 0.91 (0.22) | 0.90 (0.16) |
| 100 | 6.18 (1.50) | 0.93 (0.13) | 1.17 (0.33) | 1.17 (0.25) | 1.31 (0.35) | 1.18 (0.29) |
| 200 | 11.41 (2.67) | 0.98 (0.13) | 1.41 (0.39) | 1.33 (0.32) | 1.56 (0.43) | 1.36 (0.36) |
| 500 | 26.30 (5.68) | 1.07 (0.17) | 2.01 (0.81) | 1.82 (0.61) | 2.24 (0.73) | 1.89 (0.64) |
| AR Matrix | ||||||
| Empirical | Diagonal | Hard | Soft | SCAD | LASSO | |
| 30 | 2.34 (0.51) | 0.96 (0.09) | 1.03 (0.20) | 1.05 (0.09) | 1.00 (0.20) | 1.02 (0.15) |
| 100 | 6.22 (1.48) | 1.05 (0.09) | 1.27 (0.26) | 1.28 (0.17) | 1.34 (0.30) | 1.25 (0.19) |
| 200 | 11.44 (2.38) | 1.05 (0.12) | 1.44 (0.40) | 1.42 (0.25) | 1.59 (0.38) | 1.40 (0.31) |
| 500 | 26.69 (5.10) | 1.09 (0.10) | 1.95 (0.77) | 1.82 (0.65) | 2.19 (0.77) | 1.82 (0.65) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Nonasymptotic Estimation and Support Recovery
for High Dimensional Sparse Covariance Matrices
Adam B Kashlak [email protected]
Linglong Kong [email protected]
Department of Mathematical and Statistical Sciences
University of Alberta
Edmonton, AB, Canada T6G 2G1
Abstract
We propose a general framework for nonasymptotic covariance matrix estimation making use of concentration inequality-based confidence sets. We specify this framework for the estimation of large sparse covariance matrices through incorporation of past thresholding estimators with key emphasis on support recovery. This technique goes beyond past results for thresholding estimators as we have distribution free control over the false positive rate being the number of entries incorrectly included in the estimator’s support. In the context of support recovery, we are able to specify a false positive rate and optimize to maximize the true recoveries. This methodology guarantees exact support recovery in the case of strongly log concave data and maintains good performance in more general distributional settings. The usage of nonasymptotic dimension-free confidence sets yields good theoretical performance. Through extensive simulations, it is demonstrated to have superior performance when compared with other such methods.
Key words and phrases: Concentration Inequality Confidence Region Log Concave Measure Random Matrix Schatten Norm Sub-Exponential Measure
1 Introduction
Covariance matrices and accurate estimators of such objects are of critical importance in statistics. Various standard techniques including principal components analysis and linear and quadratic discriminant analysis rely on an accurate estimate of the covariance structure of the data. Applications can range from genetics and medical imaging data to climate and other types of data. Furthermore, in the era of high dimensional data, classical asymptotic estimators perform poorly in applications (Stein, 1975; Johnstone, 2001). Hence, we propose a general methodology for nonasymptotic covariance matrix estimation making use of confidence balls constructed from concentration inequalities. While this is a general framework with many potential applications, we specifically consider the use of thresholding estimators for sparse covariance matrices with a view towards support recovery—that is, determining which variable pairs are correlated.
Many estimators for the covariance matrix have been proposed working under the assumption of sparsity (Pourahmadi, 2011), which is, in a qualitative sense, the case when most of the off-diagonal entries are zero or negligible. Beyond mere theoretical interest, the assumption of sparsity is widely applicable to real data analysis as the practitioner may believe that many of the variable pairings will be uncorrelated. Thus, it is desirable to tailor covariance estimation procedures given this assumption of sparsity.
Sparsity in the simplest sense implies some bound on the number of non-zero entries in the columns of a covariance matrix. Thus, given a with entries for , there exists some constant such that . This can be generalized to “approximate sparsity” as in Rothman et al. (2009) by for some . Furthermore, Cai and Liu (2011) define a broader approximately sparse class by bounding weighted column sums of . In El Karoui (2008), a similar notion referred to as “-sparsity” is defined. Such classes of sparse covariance matrices allow for good theoretical performance of estimators.
One class of estimators are shrinkage estimators that follow a James-Stein approach by shrinking estimated eigenvalues, eigenvectors, or the matrix itself towards some desired target (Haff, 1980; Dey and Srinivasan, 1985; Daniels and Kass, 1999, 2001; Ledoit and Wolf, 2004; Hoff, 2009; Johnstone and Lu, 2012). Another class of sparse estimators are those that regularize the estimate with lasso-style penalties (Rothman, 2012; Bien and Tibshirani, 2011). Yet another class consists of thresholding estimators, which declare the covariance between two variables to be zero, if the estimated value is smaller than some threshold (Bickel and Levina, 2008a, b; Rothman et al., 2009; Cai and Liu, 2011). Beyond these, there are other methods such as banding and tapering, which apply only when the variables are ordered or a notation of proximity exists—for example, spatial, time series, or longitudinal data. As we will not assume such an ordering and strive to construct a methodology that is permutation invariant with respect to the variables, these approaches will not be considered. Lastly, there has also been substantial work into the estimation of the precision or inverse covariance matrix. While it is easily possible that our approach could be adapted to this setting, it will not be considered in this article and will, hence, be reserved for future research.
In this article, we propose of novel approach to the estimation of sparse covariance matrices making use of concentration inequality based confidence sets such as those constructed in Kashlak et al. (2018) for the functional data setting. In short, consider a sample of real vector valued data with mean zero and unknown covariance matrix . Concentration inequalities are used to construct a non-asymptotic confidence set for about the empirical estimate of the unknown covariance matrix, where is the sample mean. While, it has been noted—for example, see Cai and Liu (2011)—that may be a poor estimator when the dimension is large and is sparse, the confidence set is still valid given a desired coverage of . To construct a better estimator, we propose to search this confidence set for an estimator which optimizes some sparsity criterion to be concretely defined later. This estimation method adapts to the uncertainty of in the high dimensional setting, , by widening the confidence set and thus allowing our sparse estimator to lie far away from the empirical estimate. Furthermore, given some distributional assumptions, the concentration inequalities provide us with non-asymptotic dimension-free confidence sets allowing for very desirable convergence results.
Many established methods for sparse estimation make use of a regularization or penalization term incorporated to enforce sparsity (Rothman, 2012; Bien and Tibshirani, 2011). In some sense, our proposed method can be considered to be in this class of estimators. However, we do not enforce sparsity via some lasso-style penalization term, but enforce it by
- i.
choosing a desired false positive rate, , for the support recovery, 2. ii.
using that rate to construct a confidence ball about the empirical estimator, and 3. iii.
searching that ball for a sparse estimator.
The larger our -confidence set is, the sparser our estimator is allowed to be. Thus, the radius of our confidence balls acts like a regularization parameter allowing for greater sparsity as it increases. A major contribution of this work is developing a method with the ability to avoid costly cross-validation of the tuning parameter and maintain strong finite sample performance. The specific focus as discussed below and in the supplementary material is accurate support recovery, which is the identification of the non-zero entries in the covariance matrix. Our methodology allows for fixing a false positive rate—percentage of zero entries incorrectly said to be non-zero—and optimizing over the true positive rate—percentage of correctly identified non-zero entries. Furthermore, our estimation technique implements a binary search procedure resulting in a highly efficient algorithm especially when compared to the more laborious optimization required by lasso penalization.
In Section 2, the general estimation procedure is outlined, and it is specified for tuning threshold estimators with concentration methods. Section 3 discusses our approach to fixing a certain false positive rate when attempting to recover the support of the covariance matrix. In Section 4, three different types of concentration are considered for specifically log concave measures, sub-exponential distributions, and bounded random variables. Lastly, Section 5 details comprehensive simulations comparing our concentration approach to sparse estimation to standard techniques such as thresholding and penalization. Beyond simulation experiments, a real data set of gene expressions for small round blue cell tumours from the study of Khan et al. (2001) is considered.
1.1 Notation and Definitions
We will make use of both a -confidence set and a false positive rate . For the former, we have the usual definition that some data dependent set is a -confidence set for if For an estimator of in , we have to decide which of the off-diagonal entries are non-zero. The false positive rate is the probability that we incorrectly decide that a given entry is non-zero.
When defining a Banach space of matrices, there are many matrix norms that can be considered. In the article, the main norms of interest are the -Schatten norms, which will be denoted and are defined as follows.
Definition 1.1** (-Schatten Norm).**
For an arbitrary matrix and , the -Schatten norm is where is the vector of singular values of and where is the standard norm in . In the covariance matrix case where is symmetric and positive semi-definite, where is the vector of eigenvalues of . The -Schatten norm is referred to as the trace norm and the -Schatten norm as the Hilbert-Schmidt or Frobenius norm.
For , we have the usual operator norm for with respect to the norm, which is similarly the maximal eigenvalue when is symmetric positive semi-definite.
The definition of the -Schatten norm involves taking the square root of a symmetric matrix. In general, a matrix square root is only unique up to unitary transformations. However, for symmetric positive semi-definite matrices, we will only require the unique symmetric positive semi-definite square root defined as follows.
Definition 1.2** (Matrix Square Root).**
Let be a symmetric positive semi-definite matrix with eigen-decomposition where the orthonormal matrix of eigenvectors and the diagonal matrix of eigenvalues, . Then, where is the diagonal matrix with entries .
Another family of norms that will be used is the collection of entrywise matrix norms denoted, which are written in terms of norms of the entries.
Definition 1.3** (-Entrywise norm).**
For an arbitrary matrix with entries and , the -entrywise norm is with the usual modification in the case that and/or . When , these are the norms of a given matrix treated as a vector in . Note that the 2-Schatten norm coincides with the -entrywise norm.
1.2 Main contributions and connections to past work
The main contribution of this work is the construction of a general framework for tuning threshold estimators for support recovery and estimation of sparse covariance matrices. It offers finite sample guarantees and a much faster compute time than computationally expensive optimization and cross validation methods.
Past work on thresholding estimators for sparse covariance estimation began with solely considering Gaussian data and then extending to sub-Gaussian tails (Bickel and Levina, 2008a, b; Rothman et al., 2009). The more recent work of Cai and Liu (2011) also provides theoretical results for sub-Gaussian data as well as certain polynomial-type tails. However, only Gaussian data is considered in their numerical simulations. In this article, we consider strongly log-concave, heavier tailed sub-exponential, and bounded data. While bounded data is, in fact, sub-Gaussian, the concentration behaviour of such data may be dependent on the dimension of the space compared to the much better behaved strongly log concave measures that also exhibit sub-Gaussian concentration.
The principal focus of this work is to use non-asymptotic concentration inequalities to guarantee finite sample performance. Past articles are focused on proximity of their estimator to truth in operator norm as the main metric of success due to convergence in operator norm implying convergence of the eigenvalues and eigenvectors. While asymptotically, such methods have elegant theoretical convergence properties, for finite samples one can achieve better performance in operator norm distance by simply choosing the empirical diagonal matrix as an estimator—that is, the empirical estimator with off-diagonal entries set to zero. In the supplementary material, we rerun some of the numerical simulations from Rothman et al. (2009) and demonstrate that for Gaussian data the empirical diagonal matrix achieves better performance than all of the universal threshold estimators for data in for a sample of size . For sub-exponential data—albeit outside of the scope of their—the empirical diagonal matrix dominated all threshold estimators in operator norm distance even when . We thus strongly argue that the main metric of success for such sparse estimators is support recovery of the true covariance matrix.
The main theoretical results of this work are Theorem 3.1, which establishes how to fix a false positive rate for threshold estimators devoid of any distributional assumptions, and Theorem 4.2, which demonstrates support recovery—both zero and non-zero entries—in the specific case that the data has a strongly log concave measure. In the case that the data is instead sub-exponential or bounded, we do not achieve a similar limit theorem, but are still able to achieve good performance in numerical simulations. Of independent interest is Lemma 3.4, which establishes a symmetrization result for sparse random matrices making use of the techniques in Latała (2005).
2 Sparse Estimation Procedure
Let be a sample of independent and identically distributed mean zero random vectors with unknown covariance matrix . Define the empirical estimate of to be where . The goal of the following procedure is to construct a sparse estimator, , for by first constructing a non-asymptotic confidence set for centred on and then searching this set for the sparsest member. A search method using threshold estimators is outlined in Section 2.2.
The methodology is as follows:
- i.
Choose a suitable false positive rate , which will typically be close to zero. 2. ii.
Use Theorem 3.1 to determine the radius of a ball centred at such that the sparsest matrices in that ball have false positive rate . 3. iii.
Use the binary search algorithm in Section 2.2 to identify the sparsest element in the above ball denoted . 4. iv.
Considering this ball as a -confidence set, use the concentration properties of the data to control the true positive rate.
Note that we will in practise normalize to have unit diagonal in order to consistently recover the support.
2.1 Concentration Confidence Set
The first step is to construct a confidence set for about . Theoretical justification of the following is provided in Section 3.
Given a false positive rate , we construct a ball centred on as follows.
- i.
Find for some . 2. ii.
Compute , the -quantile of the magnitudes of the off-diagonal entries in . That is, is the smallest real number such that
[TABLE] 3. iii.
Apply hard thresholding to with threshold to get whose entries are
[TABLE]
which is, set off-diagonal entries to zero if they were originally less than in magnitude. 4. iv.
Construct the operator norm ball about of radius . 5. v.
Use a suitable concentration inequality to determine a bound on the coverage of this ball as a confidence set.
What we have now is
[TABLE]
This set will be searched for its sparsest member using the algorithm in the following subsection.
2.2 Thresholding within confidence sets
A generalized thresholding operator, as defined in Rothman et al. (2009), is such that
[TABLE]
which will apply element-wise to a matrix. In the past, such an operator is applied to the empirical estimate for some generally chosen via cross validation. Instead of directly choosing a threshold , our approach is to find the largest such that .
- i.
Set to be the empirical estimator normalized to have a diagonal of ones. Initialize the threshold to and write . Let be the number of steps of the recursion. Choose a false positive rate and compute as in the previous section. 2. ii.
Increase , then update as follows.
- (a)
if , set . 2. (b)
Otherwise, set . 3. iii.
Repeat step ii until has reached the desired number of iterations. Generally, as few as will suffice. 4. iv.
The resulting estimator is where is the final matrix resulting from this recursion.
Remark 2.1** (Positive Definite Estimators).**
If is not positive semi-definite, then it can be projected onto the space of positive semi-definite matrices. A standard past approach is to map the negative eigenvalues to zero or to their absolute value, which maintains the eigen-structure. However, such a projection will have an adverse effect on the support recovery problem as the estimator will no longer be sparse. An alternative is to map for some large enough to make the result positive definite. This will not effect the recovered support of the matrix. More clever projections may also be possible.
In the case that the metric is a monotonically increasing function of the Hilbert-Schmidt / Frobenius norm or another entrywise norm, then the sequence will be increasing in .
Proposition 2.2**.**
In the context of the above algorithm, if , then for any , we have
[TABLE]
Proof.
As , the entries of the matrix are equal to or larger in absolute value than the entries of . Hence by definition 1.3. ∎
This property guarantees that the above algorithm will find the sparsest in the confidence set in the sense of having the largest threshold possible. However, for an arbitrary metric or specifically other -Schatten norms, this sequence may not necessarily be strictly increasing in . Another commonly used norm, which will be shown in Section 5 to give superior performance in simulation, is the operator norm , which does not yield a monotonically increasing sequence. Though, this sequence is roughly increasing in the sense that it is lower bounded by definition by the maximum norm of the columns of , which is an increasing sequence. Furthermore, it is upper bounded by the norm of the columns of , which follows from the Gershgorin circle theorem (Iserles, 2009), and which is also an increasing sequence.
3 Fixing a false positive rate
For many sparse matrix estimation methods, theorems demonstrating sparsistency are proved. These indicate that in some asymptotic sense, the correct support of the true matrix will eventually be recovered generally as and grow together at some rate. However, none provide a method for fixing a false positive rate and finding an estimator that satisfies such a rate, which is certainly of interest to any practitioner with a finite fixed sample size. Hence, we present a method for tuning our parameter to a desired false positive rate for the covariance estimator.
Before proceeding, we will require a class of sparse matrices similar to those from Bickel and Levina (2008a, b); Rothman et al. (2009); Cai and Liu (2011). Specifically, let
[TABLE]
For the results regarding the false positive rate, we are not concerned with the lower bound and only with , the maximum number of non-zero entries per column or row. As long as increases more slowly than the dimension , which is made specific below, we can achieve a desired false positive rate without interference.
For an estimator , the false positive rate is
[TABLE]
where is the th entry of the true covariance matrix and is the th entry of the estimator . Hence, we are counting the number of non-zero entries in our estimator that should have been zero. For notation, let be the usual empirical estimate of the covariance matrix. Let be the empirical estimator with all off diagonal entries set to zero thus guaranteeing a false positive rate of zero. For , let be the empirical estimator after application of the strong threshold operator with threshold , which removes % of the off diagonal entries achieving a false positive rate of approximately due to the following lemma.
Lemma 3.1**.**
Let from Equation 3 with . Let the threshold, , be the quantile of with , and let the corresponding thresholded estimator be with th entry denoted . Then, denoting , we have that for
[TABLE]
for some constant .
Remark 3.2**.**
For this lemma, we want the -quantile of the mean zero entries, but have to work with the -quantile of the entire collection, which is contaminated by a small number of elements with non-zero mean. For , the error is hence for , thresholding based on the -quantile suffices for large enough . For small , say , we have to work harder motivating Theorem 3.1 below.
As noted in the remark, we cannot continue to threshold based on the sample quantiles for very small false positive rates. However, using the matrices, and , as reference points, we can interpolate via the following theorem to achieve any desired false positive rate.
Theorem 3.1**.**
Let from Equation 3 with for . Given a desired false positive rate, , and for some , let be the hard thresholded empirical estimator that achieves a false positive rate of . Then,
[TABLE]
where are universal constants.
Remark 3.3**.**
The above Theorem 3.1 is wholly uninteresting for large values of . However, its power arises in the non-asymptotic realm of interest—namely when —and also from highlighting the interplay between the dimension, sample size, and , the sparseness of the estimator. Furthermore, this result does not require any distributional assumption. It also does not require any assumption on the lower bound on the non-zero as it is only concerned with the that are zero.
The proof of the above theorem relies on the following lemma involving symmetrization of random covariance matrices, which may be of independent interest.
Lemma 3.4**.**
Let be a real valued symmetric random matrix with zero diagonal and mean zero entries bounded by 1 and not necessarily iid, and let be an iid symmetric Bernoulli random matrix with entries for . Denoting the entrywise or Hadamard product by , let . Let be a symmetric random matrix with iid Rademacher entries for and . Then,
[TABLE]
where are universal constants.
4 Concentration Confidence Sets
The following three subsections detail different assumptions on the data under scrutiny and the specific concentration results that apply in these cases. We consider sub-Gaussian concentration for log concave measures and for bounded random variables. We also consider sub-exponential concentration. However, this collection is by no means exhaustive. Given the wide variety of concentration inequalities being developed, our approach can be applied much more widely than to merely these three settings.
Let be some metric measuring the distance between two covariance matrices, and let be monotonically increasing. Then, the general form of the concentration inequalities is
[TABLE]
which is a bound on the tail of the distribution of as it deviates above its mean. Thus, to construct a -confidence set, the variable is chosen such that .
Now, let be our sparse estimator for . We want these two to be close in the sense of the above confidence set and therefore choose a such that . Consequently, we have that
[TABLE]
Hence, we choose close enough to to share its elegant concentration properties, but far enough away to result in a better estimator for .
4.1 Log Concave Measures
In this section, the general methods from Section 2 are specialized for an iid sample whose common measure is strongly log-concave. This property implies dimension-free sub-Gaussian concentration and includes such common distributions as the multivariate Gaussian, Chi, and Dirichlet distributions.
Definition 4.1** (Strongly log-concave measure).**
A measure on is strongly log-concave if there exists a such that and (i.e. is non-negative definite) where is the matrix of second derivatives.
From Corollary S4.5let have measures , which are all strongly log-concave with coefficients . Let be the product measure on . Then, for any -Lipschitz and for any ,
[TABLE]
This follows from Theorem S4.4and the other results contained within the supplementary material.For a detailed exposition of how sub-Gaussian concentration is established for log concave measures, see Chapter 5 of Ledoux (2001). Examples include the multivariate Gaussian and the Dirichlet distributions.
To make use of the above result, we must choose a suitable Lipschitz function . Let be independent and identically distributed random variables with covariance and with a common strongly log-concave measure with coefficient . Let be the eigenvalues of and . For some , let be the -Schatten norm, which in this case is . Note that for any . Define the function to be
For , we have that is Lipschitz with coefficient with respect to the Frobenius or Hilbert-Schmidt metric, which is established in Proposition S3.5for and and in Proposition S3.2for . That is, let , and denote and , then From here, the procedure outlined in Section 2 can be considered with the given and .
In many cases, including the two examples above, the constructed confidence set is completely dimension-free. Thus, even mild assumptions on the relationship between the sample size and the dimension , such as from the adaptive soft thresholding estimator of Cai and Liu (2011), are not needed to prove consistency in our setting. Furthermore, the concentration inequalities immediately give us a fast rate of convergence as long as with a proof provided in the supplementary material.
Theorem 4.1**.**
Let be iid with common measure . Let be strictly log concave with some fixed constant from Definition 4.1. Then, for , , and ,
[TABLE]
Remark 4.2**.**
This theorem effectively says that choosing an estimator in the ball centred around cannot be too bad assuming the niceness of log-concave measures. It also tells us how fast we can shrink the ball as increases.
A second and arguably more important issue, see the supplementary material,in the setting of sparse covariance estimation is that of support recovery or “sparsistency” (Lam and Fan, 2009; Rothman et al., 2009). To recover the support of a covariance matrix—that is, determine which entries —we will require a class of sparse matrices from Equation 3. In past work, a notation of “approximate sparsity” is considered where the first condition in is replaced with for . However, once we bound the non-zero entries away from zero by some , such “approximate sparsity” implies standard sparsity with . It is worth noting that the above Proposition 4.1 does not require such a sparsity class, because our estimator is forced to remain close enough to to follow ’s convergence to .
Theorem 4.2**.**
Let be iid with common measure . Let be strictly log concave with some fixed constant from Definition 4.1. Furthermore, let and let for any . Then, for denoting the concentration estimator using the hard thresholding estimation from Section 2.2 with the operator norm metric,
[TABLE]
where .
Remark 4.3**.**
Note that the condition that allows for a much quicker decay of the non-zero entries of than in El Karoui (2008) where the lower bound is of the form with . It is also much quicker than the similar rate achieved in Rothman et al. (2009) where the lower bound is any such that increases faster than with the enforced asymptotic condition that resulting in a rate no faster than . Though, it is worth noting that if decays to zero at a faster rate, then the above convergence rate for support recovery slows as can be seen in the proof.
5 Numerical simulations
In the following subsections, we apply the methods from the previous sections to three multivariate distributions of interest: the Gaussian, Laplace, and Rademacher distributions. In doing so, we apply Theorem 3.1 to analytically determine the ideal confidence ball radius in order to construct a sparse estimator of . We also compare the support recovery of our approach against penalized estimators and standard application of universal threshold estimators.
As mentioned before, our proposed concentration confidence set based method has a similar feel to regularized / penalized estimators as the larger the constructed confidence set is, the sparser the returned estimator will be. Thus, we compare our approach with the following lasso style estimator from the R package PDSCE (Rothman, 2013), which optimizes
[TABLE]
with . Here, the term is used to enforce positive definiteness of the final solution, and is the lasso style penalty, which enforces sparsity.
The similar method from the R package spcov (Bien and Tibshirani, 2012), which uses a majorize-minimize algorithm to determine
[TABLE]
for some penalization , was also considered but proved to run too slowly on high dimensional matrices—that is, —to be included in the numerical experiments.
Of course, we also compare our method against the four universal thresholding estimators applied to the empirical covariance matrix from (Rothman et al., 2009), Hard, Soft, SCAD, and Adaptive LASSO:
[TABLE]
where is the th entry of the empirical covariance estimate, , and . The parameter is the threshold, which is chosen in practice via cross validation with respect to the Hilbert-Schmidt norm. Briefly, the data is split in half, two empirical estimators are formed, one is thresholded, and is selected to minimize the Hilbert-Schmidt distance between the one empirical estimate and the other thresholded estimate.
5.1 Multivariate Gaussian Data
Let be independent and identically distributed mean zero random vectors with a strictly log concave measure and covariance matrix . By Corollary S4.5,there exists a constant such that where where is the empirical estimate of the covariance matrix. This results in the size confidence set for
[TABLE]
for . In the notation of Section 4.1, .
In the multivariate Gaussian case, is the maximal eigenvalue of the covariance matrix . As mentioned before, we avoid any issues of estimating in practice. Regardless of our choice for tuning the regularization parameter to a specific false positive rate negates the need for an accurate estimate of .
Table 1 displays false positive and true positive percentages for seven sparse estimators computed over 100 replications of a random sample of size of dimensional multivariate Gaussian data with a tri-diagonal covariance matrix whose diagonal entries are 1 and whose off-diagonal entries are 0.3. We can clearly see that the concentration-based estimator approaches the desired false positive rate—either 1% or 5%—as the dimension increases. In contrast, the thresholding estimators with threshold chosen via cross validation generally start with higher false positive percentages, which tend to zero as the dimension increases. As noted in previous work, hard thresholding is overly aggressive. The PDS method is very stable across changes in the dimension and maintains a constant 3.4% false positive rate and 50% true positive rate.
5.2 Multivariate Laplace Data
There are many possible ways to extend the univariate Laplace distribution, also referred to as the double exponential distribution, onto . For the following simulation study, we choose the extension detailed in Eltoft et al. (2006). Namely, let and let . Then, , which has pdf and variance . For the multivariate setting, now let be multivariate Gaussian with zero mean and covariance and, once again, let . Then, we declare to have a multivariate Laplace distribution with zero mean and covariance .
Table 2 displays false positive and true positive percentages for seven sparse estimators computed over 100 replications of a random sample of size of dimensional multivariate Laplace data with a tri-diagonal covariance matrix whose diagonal entries are 1 and whose off-diagonal entries are 0.3. Similarly to the previous setting, the concentration-based estimator approaches the desired false positive rate—either 1% or 5%—as the dimension increases. All universal thresholding estimators set most of the entries in the matrix to zero when threshold chosen via cross validation. The PDS method is still stable across changes in the dimension but fixates on a much higher false positive rate around 12.5% and a similar true positive rate of 51%.
5.3 Small Round Blue-Cell Tumour Data
Following the same analysis performed in Rothman et al. (2009) and subsequently in Cai and Liu (2011), we will consider the data set resulting from the small round blue-cell tumour (SRBCT) microarray experiment (Khan et al., 2001). The data set consists of a training set of 64 vectors containing 2308 gene expressions. The data contains four types of tumours denoted EWS, BL-NHL, NB, and RMS. As performed in the two previous papers, the genes are ranked by their respective amount of discriminative information according to their -statistic
[TABLE]
where is the sample mean, is the number of classes, is the sample size, is the sample size of class , and likewise, and are, respectively, the sample mean and variance of class . The top 40 and bottom 160 scoring genes were selected to provide a mix of the most and least informative genes.
Table 3 displays the results of applying the four threshold estimators with cross validation, the PDS method, and our concentration-based thresholding with the sub-Gaussian formula and with false positive rates of 10, 5, and 1 percent. The percentage of matrix entries that are retained for the most informative block and the least informative block are tabulated. Depending on the chosen false positive rate, our concentration-based estimators give similar results to Soft and SCAD thresholding. PDS is the least conservative of the methods as it keeps the most entries. Hard and Adaptive LASSO thresholding are the most aggressive methods.
It is also worth noting that our method is computationally efficient enough to consider the entire matrix at once. In fact, it took only 131.3 seconds to compute on an Intel i7-7567U CPU, 3.50GHz. In contrast, the PDS method, which still has significantly faster run times than cross validating the threshold estimators, took over 101 minutes to finish. False positive rates of 5%, 1%, and 0.1% were tested. The fraction of non-zero entries in was 8.6%, 2.0%, and 0.22%, respectively. For comparison, the fraction of non-zero entries retained by PDS was 17.7%. If such an analysis is meant to lead to follow-up research on specific gene pairings, then culling as many false positives as possible is of critical importance. The sparse covariance estimator was partitioned into blocks and the number of non-zero entries was tabulated for each. The results are displayed in Figure 2.
6 Supplementary Material
The supplementary material consists of five sections. The first parallels Section 4.1 and considers sub-exponential measures and bounded random variables as well as some additional simulations for multivariate Rademacher random variables. The second contains proofs of the lemmas and theorems presented in the main article. The third contains additional simulations motivating why our support recovery approach is better than past approaches. The fourth contains derivations of Lipschitz coefficients for the functions used in Section 4. The fifth is expository and contains past results from the concentraton of measure literature that were directly used in this work.
Appendix A Sub-Exponential and Bounded Data
In line with our discussion of log concanve measures in the main article, we include some information on sub-exponential measures and data that is bounded.
A.1 Sub-Exponential Distributions
Compared with the previously discussed measures with sub-Gaussian concentration, there exists a larger class of measures with sub-exponential concentration. Such measures can be specified as those that satisfy the Poincaré or spectral gap inequality [Bobkov and Ledoux, 1997, Ledoux, 2001, Gozlan, 2010]. For a random variable on with measure , this is
[TABLE]
for some and for all locally Lipschitz functions .
If satisfies such an inequality, then—see Theorem S4.6or Chapter 5 of Ledoux [2001]—for for iid copies of and for some Lipschitz function ,
[TABLE]
where in a constant depending only on and
[TABLE]
As in the log concave setting discussed in the main paper, is chosen to be
[TABLE]
which is Lipschitz with constant . This results in values of and for the above coefficients. Hence, the radius in this setting is computed to be . While an optimal (or reasonable) value for may not be known, it makes little difference given the proposed procedure for choosing detailed in the main paper for a desired false positive rate. This is because the term will be equivalently tuned to determine the optimal size of the constructed confidence set.
As in this setting is bounded below by a constant , we do not achieve the nice convergence results as in the log concave setting. However, the dimension-free concentration still allows for good performance in simulation settings as was seen in Section 6.
A.2 Bounded Random Variables
In this section, we consider random variables that are bounded in some norm. Consider a Banach space and a collection of iid random variables such that for all . Given only this assumption, the bounded differences inequality, detailed in the supplementary material and in Section 3.3.4 of Giné and Nickl [2016], can be applied in this specific setting. It provides sub-Gaussian concentration for such random variables.
Specifically, let be iid with for . Then, for any , , and
[TABLE]
This follows immediately from Theorem S4.8.
Hence, for any collection of real valued random vectors bounded in Euclidean norm, the bounded differences inequality can be applied to the empirical estimate for any of the -Schatten norms. The radius is . However, unlike in the previous setting, the bounds may not necessarily be dimension free.
Example A.1** (Distributions on the Hypercube).**
If the components such as for multivariate uniform or Rademacher random variables, then . Consequently, is not dimension free. While this makes estimation with respect to operator norm distance challenging, we can still use Theorem 1 to fix the false positive rate.
A.3 Simulations on High Dimensional Binary Vectors
Random binary vectors fall into the category of bounded random variables, which have sub-Gaussian concentration as a consequence of the bounded differences inequality—an extension of Hölder’s inequality—as discussed in Section A.2. The result is a slightly different form for the confidence balls compared with the log concave setting. And while the concentration behaviour in this setting relies on the dimension and is poor for producing an estimator that is close in operator or Hilbert-Schmidt norm, our support recovery methodology is still able to perform well in this setting.
Table 4 displays false positive and true positive percentages for seven sparse estimators computed over 100 replications of a random sample of size of dimensional multivariate Rademacher data with a tri-diagonal covariance matrix whose diagonal entries are 1 and whose off-diagonal entries are 0.3. As a consequence of the bounded differences inequality, this case also exhibits sub-Gaussian behaviour. As a consequence, Table 4 is similar to Table 1 from the main article. Concentration estimators perform better as increases; Threshold estimators are overly aggressive as increases; And the PDS method’s support recover is unaffected by the change in .
Appendix B Proofs
Proof of Lemma 1.
We begin with the collection of random variables , which we will denote . Without loss of generality, assume that have mean zero and have nonzero mean and . To achieve false positives, we would find the index corresponding to the order statistic of the , and set all entries to zero. Instead, we find the index corresponding to the order statistic of all the .
Given that , we have
[TABLE]
Thus, when considering the achieved false positive rate to the target rate , we have
[TABLE]
∎
Proof of Lemma 2.
This proof follows from the result of Latała [2005] Theorem 2—also found in Theorem 2.3.8 of Tao [2012]—without the assumption of iid entries in the random matrix but with many entries equal to zero.
We first apply the expectation with respect to and use the result from Latała [2005].
[TABLE]
with universal constants. For the second term in the above equation, we have via Jensen’s inequality and the fact that that
[TABLE]
For the first term in the above equation, we make use of the fact that and that only are non-zero resulting in
[TABLE]
Combining the above results and updating the constants as necessary gives the desired result
[TABLE]
∎
Proof of Theorem 1.
Without loss of generality, we can normalize such that the diagonal entries are 1. Thus , the dimensional identity matrix, and the off-diagonal entries of all matrices considered will be bounded in absolute value by one.
For the empirical covariance estimator, . We can decompose into three parts: the diagonal of ones; the off-diagonal terms corresponding to ; and the off-diagonal terms corresponding to . The number of non-zero off-diagonal terms is bounded in each row/column by . Hence,
[TABLE]
where has entries such that .
Let the entrywise or Hadamard product of two similar matrices and be with entry th entry . For ease of notation, we denote . Let be the result of randomly removing half of the entries from , which is where is a symmetric random matrix with iid entries. Considering the corresponding symmetric Rademacher random matrix, , we then have that
[TABLE]
where the comes from the symmetry of . Thus,
[TABLE]
This idea can be iterated. Let with the iid copies of from before. Then, similarly,
[TABLE]
Moreover,
[TABLE]
Applying Lemma 3.5 times and updating universal constants as necessary results in
[TABLE]
Thus, for , we have
[TABLE]
We want to replace the with and similarly for . The off-diagonal entries such that can contribute at most , , to the operator norm. Hence,
[TABLE]
We lastly apply the crude—but effective in the non-asymptotic setting—bound almost surely. Dividing by results in
[TABLE]
Thus, we require to make the final term negligible for large with respect to the others.
We can extend this result to arbitrary by using the simple observation that given such a , there exists an such that . Therefore, setting and replacing with the corresponding matrix from Lemma 3.1allows us to proceed as above. ∎
Proof of Theorem 2.
From the derivation in Section 2we have that
[TABLE]
for any such that . Writing and and squaring and rearranging the terms gives,
[TABLE]
Given the standard convergence result for the empirical covariance matrix that and our definition of , we now have that
[TABLE]
which holds for any such that . ∎
Proof of Theorem 3.
Let be the result of a perfect thresholding of the empirical covariance estimator. That is, has support identical to the true and non-zero entries that coincide with . Furthermore, let be some other overly-sparse covariance estimator resulting from zeroing entries in , but with more zeros than . For a radius , is the sparsest element in the corresponding confidence ball.
[TABLE]
which, assuming a large enough sample size , are the two mutually exclusive events that the estimator with correct support is not in the ball of radius and that a sparser estimator is in the ball.
For the first term in Equation B.1, we show that the probability that a matrix with the correct support lying outside of the confidence set will tend to zero.
[TABLE]
For , we have that and that . Let for simplicity of notation. Then, using the concentration result already established for Lipschitz functions of log concave measures,
[TABLE]
for some positive .
For , applying the Gershgorin circle theorem [Iserles, 2009] to the operator norm gives
[TABLE]
where is the maximal number of non-zero entries in any given column. From Proposition D.5, we have that is Lipschitz with constant . As the squared Frobenius norm is equal to the sum of the squares of the entries of the matrix, we in turn have that the entries are also Lipschitz with constant . As the maximum of Lipschitz functions is also still Lipschitz, we get similarly to case that for some .
For the second term in Equation B.1, we show that the probability of any sparser matrix than existing in the confidence ball goes to zero. Let . Then, there exists a pair of indices such that .
[TABLE]
We have that if then . Hence, Meanwhile, . Thus, as as long as . ∎
Appendix C Estimation with the Empirical Diagonal
In this section, we demonstrate that the distance in operator norm is an insufficient metric to use for the comparison of estimators for large sparse covariance matrices in the non-asymptotic setting. The operator norm’s usage in past research [Bickel and Levina, 2008a, b, El Karoui, 2008, Rothman et al., 2009] stems from the result that “convergence in operator norm implies convergence of the eigenvalues and eigenvectors.” However, this does not imply strong performance for finite samples. We demonstrate this by showing that the naive empirical diagonal covariance matrix—that is, the estimator with if and otherwise—performs better in operator norm for finite samples.
The simulation study from Rothman et al. [2009] was reproduced where four threshold estimators—hard, soft, SCAD, and adaptive LASSO—were applied to estimating the covariance matrix for a sample of random normal vectors in dimensions for three different models. We consider models 1 and 2, which respectively are autoregressive covariance matrices with entries and moving average covariance matrices with entries . In both cases, we set . The simulations were replicated 100 times and averaged. The results are displayed in Table 5 for multivariate Gaussian data and in Table 6 for multivariate Laplace data.
For multivariate Gaussian data, we see that SCAD thresholding gives superior performance in operator norm distance until where it gives comparable performance to the empirical diagonal matrix. At , the empirical diagonal now gives the best performance. In the case of multivariate Laplace data, the empirical diagonal outperforms all of the thresholding methods in all of the dimensions considered with respect to operator norm distance. It is worth noting that theoretical results for these threshold estimators were only demonstrated for sub-Gaussian data.
We understand that the performance of the threshold estimators improves asymptotically with increasing whereas the empirical diagonal will perform worse in the limit. The main point to make is that for fixed finite samples, as generally occur in practise, it is unwise to claim an estimator’s superiority based solely on the operator norm distance. Hence, we argue instead for support recovery of the true covariance matrix as the critical problem to solve in the context of high dimensional sparse covariance estimation.
Appendix D Derivations of Lipschitz constants
The following lemmas and propositions establish that specific functions used in the construction of confidence sets are, in fact, Lipschitz functions.
Lemma D.1**.**
Let and be two real valued symmetric non-negative definite matrices. Then,
[TABLE]
where is the trace class norm.
Proof.
By definition, . If is symmetric and non-negative definite, then . Hence, if and are symmetric and positive definite, then so is . Therefore,
[TABLE]
∎
Proposition D.2** (Lipschitz for ).**
Assume that and that for . The function defined as
[TABLE]
is Lipschitz with constant with respect to the metric
Proof.
Let with for all and denote and . Making use of Lemma D.1, we have
[TABLE]
∎
The next two lemmas are used to prove the Lipschitz constant for the -Schatten norms with and , respectively. The first lemma is reminiscent of the Cauchy-Schwarz inequality in the setting of the -Schatten norm.
Lemma D.3**.**
*Let . Then, for the Frobenius norm, *
[TABLE]
Proof.
For any matrix , we have that . Hence, starting from the left hand side of the desired inequality and applying the Cauchy-Schwarz inequality gives us
[TABLE]
∎
Lemma D.4**.**
Let . Then, for the operator norm,
[TABLE]
Proof.
Using the definition of the operator norm and the Cauchy-Schwarz inequality, we have that
[TABLE]
∎
Proposition D.5** (Lipschitz for or ).**
Assume that and that for . Let . The function defined as
[TABLE]
is Lipschitz with constant with respect to the metric
Proof.
To establish that is Lipschitz with the desired constant, we proceed by bounding the Gâteaux derivative. Let .For and any such that and ,
[TABLE]
where we used the facts that, for , , that
[TABLE]
and that
[TABLE]
Applying Lemma D.3 in the case and Lemma D.4 in the case shows that for all with . With application of the Mean Value Theorem, we have the desired Lipschitz constant.
In the case that , we also achieve the same Lipschitz constant. Indeed, as is positive semi-definite, the norm can only be zero if all . Hence, for any ,
[TABLE]
∎
It is conjectured that the function is 1-Lipschitz for all , which follows immediately if Lemmas D.3 and D.4 can be expanded to similar results for all .
Appendix E Concentration Results
The following is a brief expository section detailing results used and the associated references for the various concentration of measure tools used throughout this work. More details on these topics can be found in Ledoux [2001], Boucheron et al. [2013], Giné and Nickl [2016].
E.1 Concentration results for log concave measures
Gaussian concentration for log concave measures is established via the following theorems. In short, Theorem E.2 states that log concave measures satisfy a logarithmic Sobolev inequality, which bounds the entropy of the measure; see Definition E.1. Logarithmic Sobolev inequalities were first introduced in Gross [1975], and this result is due to Bakry and Émery [1984]. Following that, Theorem E.3 links the logarithmic Sobolev inequality with Gaussian concentration. Finally, Corollary E.4 extends this Gaussian concentration to product measures whose individual components satisfy logarithmic Sobolev inequalities in a dimension-free way due to the subadditivity of the entropy.
Definition E.1** (Entropy).**
For a probability measure on a measurable space and for any non-negative measurable function on , the entropy is
[TABLE]
Theorem E.2** (Ledoux [2001], Theorem 5.2).**
Let be strongly log-concave on for some . Then, satisfies the logarithmic Sobolev inequality. That is, for all smooth ,
[TABLE]
Theorem E.3** (Ledoux [2001], Theorem 5.3).**
If is a probability measure on such that then has Gaussian concentration. That is, Let be a random variable with law . Then, for all -Lipschitz functions and for all ,
[TABLE]
Theorem E.4** (Ledoux [2001], Corollary 5.7).**
Let be random variables with measures , which are all strongly log-concave with coefficients . Let be the product measure on . Then,
[TABLE]
Combining Theorems E.4 and E.3 immediately gives the following corollary.
Corollary E.5**.**
Let have measures , which are all strongly log-concave with coefficients . Let be the product measure on . Then, for any -Lipschitz and for any ,
[TABLE]
E.2 Concentration results for sub-exponential measures
If the log Sobolev inequality from above is replaced with the weaker spectral gap or Poincaré inequality, then we have the sub-exponential measures.
Theorem E.6** (Ledoux [2001], Corollary 5.15).**
Let , a random variable on with measure , satisfy the Poincaré inequality
[TABLE]
for some and for all locally Lipschitz functions . Then, for iid copies of and for some Lipschitz function ,
[TABLE]
where in a constant depending only on and
[TABLE]
E.3 Concentration results for bounded random variables
The following results can be found in more depth in Giné and Nickl [2016] Section 3.3.4 and specifically in Example 3.3.13 (a). Theorem E.8 below is effectively a more general version of Hoeffding’s Inequality. To establish it, we begin with the definition of functions of bounded differences.
Definition E.7** (Functions of Bounded Differences).**
A function is of bounded differences if
[TABLE]
Then, Gaussian concentration can be established for functions of bounded differences by the following theorem.
Theorem E.8**.**
Let and where has bounded differences with . Then, for all ,
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bakry and Émery [1984] Dominique Bakry and Michel Émery. Hypercontractivité de semi-groupes de diffusion. Comptes rendus des séances de l’Académie des sciences. Série 1, Mathématique , 299(15):775–778, 1984.
- 2Bickel and Levina [2008 a] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics , pages 2577–2604, 2008 a.
- 3Bickel and Levina [2008 b] Peter J Bickel and Elizaveta Levina. Regularized estimation of large covariance matrices. The Annals of Statistics , pages 199–227, 2008 b.
- 4Bien and Tibshirani [2012] Jacob Bien and Rob Tibshirani. spcov: Sparse Estimation of a Covariance Matrix , 2012. URL https://CRAN.R-project.org/package=spcov . R package version 1.01.
- 5Bien and Tibshirani [2011] Jacob Bien and Robert J Tibshirani. Sparse estimation of a covariance matrix. Biometrika , 98(4):807–820, 2011.
- 6Bobkov and Ledoux [1997] Sergey Bobkov and Michel Ledoux. Poincaré’s inequalities and Talagrand’s concentration phenomenon for the exponential distribution. Probability Theory and Related Fields , 107(3):383–400, 1997.
- 7Boucheron et al. [2013] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence . Oxford University Press, 2013.
- 8Cai and Liu [2011] Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association , 106(494):672–684, 2011.
