Nonasymptotic estimation and support recovery for high dimensional   sparse covariance matrices

Adam B Kashlak; Linglong Kong

arXiv:1705.02679·stat.ME·December 17, 2020

Nonasymptotic estimation and support recovery for high dimensional sparse covariance matrices

Adam B Kashlak, Linglong Kong

PDF

TL;DR

This paper introduces a flexible, nonasymptotic framework for estimating high-dimensional sparse covariance matrices using concentration inequalities, improving support recovery and outperforming existing methods in simulations.

Contribution

It develops a general, distribution-agnostic approach for covariance estimation with confidence sets, extending thresholding techniques and optimizing support recovery.

Findings

01

Superior performance in simulations compared to existing methods

02

Effective support recovery with controlled false positive rate

03

Applicable to a wide range of estimators and distributional assumptions

Abstract

We propose a general framework for nonasymptotic covariance matrix estimation making use of concentration inequality-based confidence sets. We specify this framework for the estimation of large sparse covariance matrices through incorporation of past thresholding estimators with key emphasis on support recovery. This technique goes beyond past results for thresholding estimators by allowing for a wide range of distributional assumptions beyond merely sub-Gaussian tails. This methodology can furthermore be adapted to a wide range of other estimators and settings. The usage of nonasymptotic dimension-free confidence sets yields good theoretical performance. Through extensive simulations, it is demonstrated to have superior performance when compared with other such methods. In the context of support recovery, we are able to specify a false positive rate and optimize to maximize the true…

Tables6

Table 1. Table 1: Percentage of false and true positives for multivariate Gaussian data and Σ Σ \Sigma tri-diagonal with diagonal entries 1 and off-diagonal entries 0.3.

	False Positive %				True Positive %
Dimension	50	100	200	500	50	100	200	500
CoM 1%	0.0	0.1	0.3	1.0	0.0	7.7	20.7	32.0
CoM 5%	1.0	2.2	3.5	4.7	33.1	42.9	51.5	56.0
PDS	3.4	3.4	3.4	3.4	50.0	50.0	51.5	50.6
Hard	0.0	0.0	0.0	0.0	0.3	0.0	0.0	0.0
Soft	2.0	0.7	0.2	0.0	38.5	25.4	16.2	7.5
SCAD	2.1	0.7	0.3	0.0	39.0	26.0	16.4	7.5
Adpt	0.3	0.1	0.0	0.0	17.4	10.0	5.8	2.0

Table 2. Table 2: Percentage of false and true positives for multivariate Laplace data and Σ Σ \Sigma tri-diagonal with diagonal entries 1 and off-diagonal entries 0.3.

	False Positive %				True Positive %
Dimension	50	100	200	500	50	100	200	500
CoM 1%	0.2	0.4	0.7	1.1	4.5	9.2	13.0	17.2
CoM 5%	2.2	3.3	4.1	4.7	22.8	29.3	32.1	34.1
PDS	12.4	12.7	12.2	12.2	51.0	51.5	51.0	51.2
Hard	0.1	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Soft	1.2	0.4	0.2	0.0	11.3	1.8	0.0	0.0
SCAD	0.8	0.3	0.2	0.0	8.6	0.0	0.0	0.0
Adpt	0.2	0.1	0.1	0.0	0.0	0.0	0.0	0.0

Table 3. Table 3: The percentages of non-zero off-diagonal entries in the six covariance estimates partitioned into two parts: the informative 40 × 40 40 40 40\times 40 block of the highest scoring genes; the uninformative remaining matrix entries.

non-zero (%)	CoM 10%	CoM 5%	CoM 1%	PDS
Informative	30.3%	25.6%	8.5%	47.3%
Uninformative	5.4%	2.7%	0.4%	15.6%
	Hard	Soft	SCAD	Adpt
Informative	6.0%	24.7%	21.3%	9.9%
Uninformative	0.3%	2.3%	1.8%	0.7%

Table 4. Table 4: Percentage of false and true positives for multivariate Rademacher data and Σ Σ \Sigma tri-diagonal with diagonal entries 1 and off-diagonal entries 0.3.

Dimension	50	100	200	500		50	100	200	500
	False Positive %				True Positive %
CoM 1%	0.0	0.1	0.3	0.9		0.0	7.2	17.5	30.7
CoM 5%	0.9	1.9	3.2	4.4		28.9	41.1	49.0	54.5
PDS	3.4	3.4	3.4	3.4		50.2	50.3	50.8	50.6
Hard	0.0	0.0	0.0	0.0		0.1	0.0	0.0	0.0
Soft	2.4	0.8	0.2	0.0		44.2	29.3	18.1	9.5
SCAD	1.8	1.0	0.5	0.1		40.3	33.3	24.1	13.1
Adpt	0.2	0.1	0.1	0.0		16.5	9.4	5.5	2.5

Table 5. Table 5: Distances from six different covariance estimators to truth in operator norm for sample size n = 100 𝑛 100 n=100 , dimensions d = 30 , 100 , 200 , 500 𝑑 30 100 200 500 d=30,100,200,500 , and observations drawn from a multivariate Gaussian distribution. Standard deviations computed over the 100 replications are in brackets.

$d$	Empirical	Diagonal	Hard	Soft	SCAD	LASSO
	MA Matrix
30	1.32 (0.14)	0.72 (0.04)	0.72 (0.09)	0.70 (0.06)	0.63 (0.07)	0.64 (0.07)
100	3.03 (0.19)	0.77 (0.04)	0.87 (0.10)	0.85 (0.04)	0.73 (0.06)	0.77 (0.06)
200	4.92 (0.21)	0.79 (0.04)	0.95 (0.11)	0.91 (0.03)	0.79 (0.06)	0.82 (0.05)
500	9.73 (0.25)	0.83 (0.05)	1.06 (0.11)	0.98 (0.02)	0.88 (0.06)	0.88 (0.05)
	AR Matrix
$d$	Empirical	Diagonal	Hard	Soft	SCAD	LASSO
30	1.33 (0.16)	0.90 (0.04)	0.79 (0.10)	0.83 (0.07)	0.74 (0.09)	0.76 (0.09)
100	3.06 (0.21)	0.94 (0.03)	0.95 (0.08)	1.02 (0.04)	0.86 (0.05)	0.92 (0.05)
200	4.99 (0.21)	0.95 (0.02)	1.00 (0.09)	1.09 (0.03)	0.92 (0.04)	0.97 (0.03)
500	9.80 (0.26)	0.97 (0.02)	1.04 (0.08)	1.16 (0.02)	0.99 (0.04)	1.04 (0.03)

Table 6. Table 6: Distances from six different covariance estimators to truth in operator norm for sample size n = 100 𝑛 100 n=100 , dimensions d = 30 , 100 , 200 , 500 𝑑 30 100 200 500 d=30,100,200,500 , and observations drawn from a multivariate Laplace distribution. Standard deviations computed over the 100 replications are in brackets.

	MA Matrix
$d$	Empirical	Diagonal	Hard	Soft	SCAD	LASSO
30	2.23 (0.55)	0.86 (0.12)	0.94 (0.23)	0.93 (0.11)	0.91 (0.22)	0.90 (0.16)
100	6.18 (1.50)	0.93 (0.13)	1.17 (0.33)	1.17 (0.25)	1.31 (0.35)	1.18 (0.29)
200	11.41 (2.67)	0.98 (0.13)	1.41 (0.39)	1.33 (0.32)	1.56 (0.43)	1.36 (0.36)
500	26.30 (5.68)	1.07 (0.17)	2.01 (0.81)	1.82 (0.61)	2.24 (0.73)	1.89 (0.64)
	AR Matrix
$d$	Empirical	Diagonal	Hard	Soft	SCAD	LASSO
30	2.34 (0.51)	0.96 (0.09)	1.03 (0.20)	1.05 (0.09)	1.00 (0.20)	1.02 (0.15)
100	6.22 (1.48)	1.05 (0.09)	1.27 (0.26)	1.28 (0.17)	1.34 (0.30)	1.25 (0.19)
200	11.44 (2.38)	1.05 (0.12)	1.44 (0.40)	1.42 (0.25)	1.59 (0.38)	1.40 (0.31)
500	26.69 (5.10)	1.09 (0.10)	1.95 (0.77)	1.82 (0.65)	2.19 (0.77)	1.82 (0.65)

Equations185

\frac{# { ∣ σ ^ _{i, j} ∣ > λ ∣ i < j }}{d ( d - 1 ) /2} \leq η .

\frac{# { ∣ σ ^ _{i, j} ∣ > λ ∣ i < j }}{d ( d - 1 ) /2} \leq η .

(\hat{\Sigma}^{\mathrm{emp}}_{\lambda})_{i,j}=\left\{\begin{array}[]{ll}\hat{\sigma}_{i,j}&i=j\text{ or }\lvert\hat{\sigma}_{i,j}\rvert\geq\lambda\\ 0&\text{otherwise}\end{array}\right.

(\hat{\Sigma}^{\mathrm{emp}}_{\lambda})_{i,j}=\left\{\begin{array}[]{ll}\hat{\sigma}_{i,j}&i=j\text{ or }\lvert\hat{\sigma}_{i,j}\rvert\geq\lambda\\ 0&\text{otherwise}\end{array}\right.

B_{ρ} = {Π \in R^{d \times d} : ∥ Π - \hat{Σ}^{emp} ∥_{\infty} \leq r} .

B_{ρ} = {Π \in R^{d \times d} : ∥ Π - \hat{Σ}^{emp} ∥_{\infty} \leq r} .

∣ s_{λ} (z)∣ \leq z, s_{λ} (z) = 0 for ∣ z ∣ \leq λ, and ∣ s_{λ} (z) - z ∣ \leq λ,

∣ s_{λ} (z)∣ \leq z, s_{λ} (z) = 0 for ∣ z ∣ \leq λ, and ∣ s_{λ} (z) - z ∣ \leq λ,

∥ \hat{Σ}_{λ_{1}}^{sp} - \hat{Σ}^{emp} ∥_{p, q} \geq ∥ \hat{Σ}_{λ_{2}}^{sp} - \hat{Σ}^{emp} ∥_{p, q} .

∥ \hat{Σ}_{λ_{1}}^{sp} - \hat{Σ}^{emp} ∥_{p, q} \geq ∥ \hat{Σ}_{λ_{2}}^{sp} - \hat{Σ}^{emp} ∥_{p, q} .

U (κ, δ) = {Σ \in R^{d \times d} : i = 1, \dots, d max j = 1 \sum d 1 [σ_{i, j} \neq = 0] \leq κ, if σ_{i, j} \neq = 0 then ∣ σ_{i, j} ∣ \geq δ > 0} .

U (κ, δ) = {Σ \in R^{d \times d} : i = 1, \dots, d max j = 1 \sum d 1 [σ_{i, j} \neq = 0] \leq κ, if σ_{i, j} \neq = 0 then ∣ σ_{i, j} ∣ \geq δ > 0} .

ρ (\tilde{Σ}) = \frac{# { σ ~ _{i, j} \neq = 0 ∣ σ _{i, j} = 0 , i > j }}{d ( d - 1 ) /2}

ρ (\tilde{Σ}) = \frac{# { σ ~ _{i, j} \neq = 0 ∣ σ _{i, j} = 0 , i > j }}{d ( d - 1 ) /2}

∣ \overset{η}{^} - η ∣ \leq C d^{ν - 1} .

∣ \overset{η}{^} - η ∣ \leq C d^{ν - 1} .

η \frac{E ∥ Σ ^ _{ρ}^{emp} - Σ ^ _{0}^{emp} ∥ _{\infty}}{E ∥ Σ ^ _{η}^{emp} - Σ ^ _{0}^{emp} ∥ _{\infty}} - ρ \leq K_{1} n ρ^{1/2} d^{- 1/4} + K_{2} n ρ^{1/4} d^{- 1/2} + o (n d^{- 1/2})

η \frac{E ∥ Σ ^ _{ρ}^{emp} - Σ ^ _{0}^{emp} ∥ _{\infty}}{E ∥ Σ ^ _{η}^{emp} - Σ ^ _{0}^{emp} ∥ _{\infty}} - ρ \leq K_{1} n ρ^{1/2} d^{- 1/4} + K_{2} n ρ^{1/4} d^{- 1/2} + o (n d^{- 1/2})

E ∥ A \circ E ∥_{\infty} \leq K_{1} d^{1/2} ρ^{1/4} + K_{2} d^{3/4} ρ^{1/2}

E ∥ A \circ E ∥_{\infty} \leq K_{1} d^{1/2} ρ^{1/4} + K_{2} d^{3/4} ρ^{1/2}

P (d (Σ, \hat{Σ}^{emp}) \geq E d (Σ, \hat{Σ}^{emp}) + r) \leq e^{- ψ (r)},

P (d (Σ, \hat{Σ}^{emp}) \geq E d (Σ, \hat{Σ}^{emp}) + r) \leq e^{- ψ (r)},

P (d (\hat{Σ}^{sp}, Σ) \geq E d (\hat{Σ}^{emp}, Σ) + 2 r_{α})

P (d (\hat{Σ}^{sp}, Σ) \geq E d (\hat{Σ}^{emp}, Σ) + 2 r_{α})

\leq P (d (\hat{Σ}^{sp}, \hat{Σ}^{emp}) + d (\hat{Σ}^{emp}, Σ) \geq E d (\hat{Σ}^{emp}, Σ) + 2 r_{α})

\leq P (d (\hat{Σ}^{emp}, Σ) \geq E d (\hat{Σ}^{emp}, Σ) + r_{α}) \leq exp (- ψ (r_{α})) = α .

P (ϕ (X_{1}, \dots, X_{n}) \geq E ϕ (X_{1}, \dots, X_{n}) + r) \leq e^{- m i n_{i} c_{i} r^{2} /2} .

P (ϕ (X_{1}, \dots, X_{n}) \geq E ϕ (X_{1}, \dots, X_{n}) + r) \leq e^{- m i n_{i} c_{i} r^{2} /2} .

\hat{Σ}^{sp} : ∥ \hat{Σ}^{sp} - \hat{Σ}^{emp} ∥_{p} \leq r_{α} sup P (\hat{Σ}^{sp} - Σ_{p} \geq O (n^{- 1/2} (1 + n^{- 1/4} - lo g α)^{2})) \leq α .

\hat{Σ}^{sp} : ∥ \hat{Σ}^{sp} - \hat{Σ}^{emp} ∥_{p} \leq r_{α} sup P (\hat{Σ}^{sp} - Σ_{p} \geq O (n^{- 1/2} (1 + n^{- 1/4} - lo g α)^{2})) \leq α .

n \to \infty lim P (supp (\hat{Σ}^{sp}) \neq = supp (Σ)) = 0

n \to \infty lim P (supp (\hat{Σ}^{sp}) \neq = supp (Σ)) = 0

\hat{Σ}^{PDS} = Σ \geq 0 ar g min {∥ Σ - \hat{Σ}^{emp} ∥_{2} - τ lo g det (Σ) + λ ∥ Σ ∥_{ℓ^{1}}}

\hat{Σ}^{PDS} = Σ \geq 0 ar g min {∥ Σ - \hat{Σ}^{emp} ∥_{2} - τ lo g det (Σ) + λ ∥ Σ ∥_{ℓ^{1}}}

\hat{Σ}^{MMA} = Σ \geq 0 ar g min {tr (\hat{Σ}^{emp} Σ^{- 1}) - lo g det (Σ^{- 1}) + λ ∥ Σ ∥_{ℓ^{1}}}

\hat{Σ}^{MMA} = Σ \geq 0 ar g min {tr (\hat{Σ}^{emp} Σ^{- 1}) - lo g det (Σ^{- 1}) + λ ∥ Σ ∥_{ℓ^{1}}}

\hat{Σ}_{λ}^{Hard}

\hat{Σ}_{λ}^{Hard}

\hat{Σ}_{λ}^{SCAD}

\hat{Σ}_{λ}^{Soft}

\hat{Σ}_{λ}^{Adpt}

C_{1 - α} = {Σ : ∥ \hat{Σ}^{emp} - Σ ∥_{p}^{1/2} \leq E ∥ \hat{Σ}^{emp} - Σ ∥_{p}^{1/2} + (- 2 c_{0} / n) lo g α}

C_{1 - α} = {Σ : ∥ \hat{Σ}^{emp} - Σ ∥_{p}^{1/2} \leq E ∥ \hat{Σ}^{emp} - Σ ∥_{p}^{1/2} + (- 2 c_{0} / n) lo g α}

F = \frac{\frac{1}{k - 1} \sum _{m = 1}^{k} n _{m} ( x ˉ _{m} - x ˉ ) ^{2}}{\frac{1}{n - k} \sum _{m = 1}^{k} ( n _{m} - 1 ) σ ^ _{m}^{2}}

F = \frac{\frac{1}{k - 1} \sum _{m = 1}^{k} n _{m} ( x ˉ _{m} - x ˉ ) ^{2}}{\frac{1}{n - k} \sum _{m = 1}^{k} ( n _{m} - 1 ) σ ^ _{m}^{2}}

Var (f (X)) \leq C \int ∣ \nabla f ∣^{2} d μ

Var (f (X)) \leq C \int ∣ \nabla f ∣^{2} d μ

P (ϕ (X_{1}, \dots, X_{n}) \geq E ϕ (X_{1}, \dots, X_{n}) + r) \leq exp (- \frac{1}{K} min {\frac{r}{b}, \frac{r ^{2}}{a ^{2}}})

P (ϕ (X_{1}, \dots, X_{n}) \geq E ϕ (X_{1}, \dots, X_{n}) + r) \leq exp (- \frac{1}{K} min {\frac{r}{b}, \frac{r ^{2}}{a ^{2}}})

a^{2} \geq i = 1 \sum n ∣ \nabla_{i} ϕ ∣^{2}, b \geq i = 1, \dots, n max ∣ \nabla_{i} ϕ ∣ .

a^{2} \geq i = 1 \sum n ∣ \nabla_{i} ϕ ∣^{2}, b \geq i = 1, \dots, n max ∣ \nabla_{i} ϕ ∣ .

ϕ (X_{1}, \dots, X_{n}) = \frac{1}{n} i = 1 \sum n (X_{i} - E X) (X_{i} - E X)^{T}_{p}^{1/2},

ϕ (X_{1}, \dots, X_{n}) = \frac{1}{n} i = 1 \sum n (X_{i} - E X) (X_{i} - E X)^{T}_{p}^{1/2},

P (\hat{Σ}^{emp} - Σ_{p} \geq E \hat{Σ}^{emp} - Σ_{p} + r) \leq e^{- 2 n r^{2} / U^{2}} .

P (\hat{Σ}^{emp} - Σ_{p} \geq E \hat{Σ}^{emp} - Σ_{p} + r) \leq e^{- 2 n r^{2} / U^{2}} .

∣ k_{0} - \hat{k} ∣ \leq κ d .

∣ k_{0} - \hat{k} ∣ \leq κ d .

∣ η - \overset{η}{^} ∣ \leq \frac{κ d}{N _{0}} = \frac{2 κ}{d - 1 - 2 κ} = O (d^{ν - 1}) .

∣ η - \overset{η}{^} ∣ \leq \frac{κ d}{N _{0}} = \frac{2 κ}{d - 1 - 2 κ} = O (d^{ν - 1}) .

E ∥ A \circ E ∥_{\infty}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Nonasymptotic Estimation and Support Recovery

for High Dimensional Sparse Covariance Matrices

Adam B Kashlak [email protected]

Linglong Kong [email protected]

Department of Mathematical and Statistical Sciences

University of Alberta

Edmonton, AB, Canada T6G 2G1

Abstract

We propose a general framework for nonasymptotic covariance matrix estimation making use of concentration inequality-based confidence sets. We specify this framework for the estimation of large sparse covariance matrices through incorporation of past thresholding estimators with key emphasis on support recovery. This technique goes beyond past results for thresholding estimators as we have distribution free control over the false positive rate being the number of entries incorrectly included in the estimator’s support. In the context of support recovery, we are able to specify a false positive rate and optimize to maximize the true recoveries. This methodology guarantees exact support recovery in the case of strongly log concave data and maintains good performance in more general distributional settings. The usage of nonasymptotic dimension-free confidence sets yields good theoretical performance. Through extensive simulations, it is demonstrated to have superior performance when compared with other such methods.

Key words and phrases: Concentration Inequality Confidence Region Log Concave Measure Random Matrix Schatten Norm Sub-Exponential Measure

1 Introduction

Covariance matrices and accurate estimators of such objects are of critical importance in statistics. Various standard techniques including principal components analysis and linear and quadratic discriminant analysis rely on an accurate estimate of the covariance structure of the data. Applications can range from genetics and medical imaging data to climate and other types of data. Furthermore, in the era of high dimensional data, classical asymptotic estimators perform poorly in applications (Stein, 1975; Johnstone, 2001). Hence, we propose a general methodology for nonasymptotic covariance matrix estimation making use of confidence balls constructed from concentration inequalities. While this is a general framework with many potential applications, we specifically consider the use of thresholding estimators for sparse covariance matrices with a view towards support recovery—that is, determining which variable pairs are correlated.

Many estimators for the covariance matrix have been proposed working under the assumption of sparsity (Pourahmadi, 2011), which is, in a qualitative sense, the case when most of the off-diagonal entries are zero or negligible. Beyond mere theoretical interest, the assumption of sparsity is widely applicable to real data analysis as the practitioner may believe that many of the variable pairings will be uncorrelated. Thus, it is desirable to tailor covariance estimation procedures given this assumption of sparsity.

Sparsity in the simplest sense implies some bound on the number of non-zero entries in the columns of a covariance matrix. Thus, given a $\Sigma\in\mathbb{R}^{d\times d}$ with entries $\sigma_{i,j}$ for $i,j=1,\ldots,d$ , there exists some constant $k>0$ such that $\max_{j=1,\ldots,d}\sum_{i=1}^{d}\bm{1}\!\left[\sigma_{i,j}\neq 0\right]\leq k$ . This can be generalized to “approximate sparsity” as in Rothman et al. (2009) by $\max_{j=1,\ldots,d}\sum_{i=1}^{d}\lvert\sigma_{i,j}\rvert^{q}\leq k$ for some $q\in[0,1)$ . Furthermore, Cai and Liu (2011) define a broader approximately sparse class by bounding weighted column sums of $\Sigma$ . In El Karoui (2008), a similar notion referred to as “ $\beta$ -sparsity” is defined. Such classes of sparse covariance matrices allow for good theoretical performance of estimators.

One class of estimators are shrinkage estimators that follow a James-Stein approach by shrinking estimated eigenvalues, eigenvectors, or the matrix itself towards some desired target (Haff, 1980; Dey and Srinivasan, 1985; Daniels and Kass, 1999, 2001; Ledoit and Wolf, 2004; Hoff, 2009; Johnstone and Lu, 2012). Another class of sparse estimators are those that regularize the estimate with lasso-style penalties (Rothman, 2012; Bien and Tibshirani, 2011). Yet another class consists of thresholding estimators, which declare the covariance between two variables to be zero, if the estimated value is smaller than some threshold (Bickel and Levina, 2008a, b; Rothman et al., 2009; Cai and Liu, 2011). Beyond these, there are other methods such as banding and tapering, which apply only when the variables are ordered or a notation of proximity exists—for example, spatial, time series, or longitudinal data. As we will not assume such an ordering and strive to construct a methodology that is permutation invariant with respect to the variables, these approaches will not be considered. Lastly, there has also been substantial work into the estimation of the precision or inverse covariance matrix. While it is easily possible that our approach could be adapted to this setting, it will not be considered in this article and will, hence, be reserved for future research.

In this article, we propose of novel approach to the estimation of sparse covariance matrices making use of concentration inequality based confidence sets such as those constructed in Kashlak et al. (2018) for the functional data setting. In short, consider a sample of real vector valued data $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ with mean zero and unknown covariance matrix $\Sigma$ . Concentration inequalities are used to construct a non-asymptotic confidence set for $\Sigma$ about the empirical estimate of the unknown covariance matrix, $\hat{\Sigma}^{\mathrm{emp}}=n^{-1}\sum_{i=1}^{n}(X_{i}-\bar{X}){(X_{i}-\bar{X})}^{\mathrm{T}}$ where $\bar{X}=n^{-1}\sum_{i=1}^{n}X_{i}$ is the sample mean. While, it has been noted—for example, see Cai and Liu (2011)—that $\hat{\Sigma}^{\mathrm{emp}}$ may be a poor estimator when the dimension $d$ is large and $\Sigma$ is sparse, the confidence set is still valid given a desired coverage of $(1-\alpha)$ . To construct a better estimator, we propose to search this confidence set for an estimator $\hat{\Sigma}^{\mathrm{sp}}$ which optimizes some sparsity criterion to be concretely defined later. This estimation method adapts to the uncertainty of $\hat{\Sigma}^{\mathrm{emp}}$ in the high dimensional setting, $d\gg n$ , by widening the confidence set and thus allowing our sparse estimator to lie far away from the empirical estimate. Furthermore, given some distributional assumptions, the concentration inequalities provide us with non-asymptotic dimension-free confidence sets allowing for very desirable convergence results.

Many established methods for sparse estimation make use of a regularization or penalization term incorporated to enforce sparsity (Rothman, 2012; Bien and Tibshirani, 2011). In some sense, our proposed method can be considered to be in this class of estimators. However, we do not enforce sparsity via some lasso-style penalization term, but enforce it by

i.

choosing a desired false positive rate, $0<\rho\ll 1$ , for the support recovery, 2. ii.

using that rate to construct a $(1-\alpha)$ confidence ball about the empirical estimator, and 3. iii.

searching that ball for a sparse estimator.

The larger our $(1-\alpha)$ -confidence set is, the sparser our estimator is allowed to be. Thus, the radius of our confidence balls acts like a regularization parameter allowing for greater sparsity as it increases. A major contribution of this work is developing a method with the ability to avoid costly cross-validation of the tuning parameter and maintain strong finite sample performance. The specific focus as discussed below and in the supplementary material is accurate support recovery, which is the identification of the non-zero entries in the covariance matrix. Our methodology allows for fixing a false positive rate—percentage of zero entries incorrectly said to be non-zero—and optimizing over the true positive rate—percentage of correctly identified non-zero entries. Furthermore, our estimation technique implements a binary search procedure resulting in a highly efficient algorithm especially when compared to the more laborious optimization required by lasso penalization.

In Section 2, the general estimation procedure is outlined, and it is specified for tuning threshold estimators with concentration methods. Section 3 discusses our approach to fixing a certain false positive rate when attempting to recover the support of the covariance matrix. In Section 4, three different types of concentration are considered for specifically log concave measures, sub-exponential distributions, and bounded random variables. Lastly, Section 5 details comprehensive simulations comparing our concentration approach to sparse estimation to standard techniques such as thresholding and penalization. Beyond simulation experiments, a real data set of gene expressions for small round blue cell tumours from the study of Khan et al. (2001) is considered.

1.1 Notation and Definitions

We will make use of both a $(1-\alpha)$ -confidence set and a false positive rate $\rho$ . For the former, we have the usual definition that some data dependent set $\mathcal{C}_{1-\alpha}$ is a $(1-\alpha)$ -confidence set for $\Sigma$ if $\mathrm{P}\left(\Sigma\notin\mathcal{C}_{1-\alpha}\right)\leq\alpha.$ For an estimator of $\Sigma$ in $\mathbb{R}^{d\times d}$ , we have to decide which of the $d(d-1)/2$ off-diagonal entries are non-zero. The false positive rate $\rho$ is the probability that we incorrectly decide that a given entry is non-zero.

When defining a Banach space of matrices, there are many matrix norms that can be considered. In the article, the main norms of interest are the $p$ -Schatten norms, which will be denoted $\lVert\cdot\rVert_{p}$ and are defined as follows.

Definition 1.1 ( $p$ -Schatten Norm).

For an arbitrary matrix $\Sigma\in\mathbb{R}^{k\times l}$ and $p\in[1,\infty)$ , the $p$ -Schatten norm is $\left\lVert\Sigma\right\rVert_{p}=\mathrm{tr}\left(({\Sigma}^{\mathrm{T}}\Sigma)^{p/2}\right)^{1/p}=\lVert{\bm{\nu}}\rVert_{\ell^{p}}=\left(\sum_{i=1}^{\min\{k,l\}}\nu_{i}^{p}\right)^{1/p}$ where ${\bm{\nu}}=(\nu_{1},\ldots,\nu_{\min\{k,l\}})$ is the vector of singular values of $\Sigma$ and where $\lVert\cdot\rVert_{\ell^{p}}$ is the standard $\ell^{p}$ norm in $\mathbb{R}^{d}$ . In the covariance matrix case where $\Sigma\in\mathbb{R}^{d\times d}$ is symmetric and positive semi-definite, $\left\lVert\Sigma\right\rVert_{p}=\mathrm{tr}\left(\Sigma^{p}\right)^{1/p}=\lVert{\bm{\lambda}}\rVert_{\ell^{p}}$ where ${\bm{\lambda}}$ is the vector of eigenvalues of $\Sigma$ . The $1$ -Schatten norm is referred to as the trace norm and the $2$ -Schatten norm as the Hilbert-Schmidt or Frobenius norm.

For $p=\infty$ , we have the usual operator norm for $\Sigma:\mathbb{R}^{l}\rightarrow\mathbb{R}^{k}$ with respect to the $\ell^{2}$ norm, $\left\lVert\Sigma\right\rVert_{\infty}=\sup_{\lVert u\rVert_{\ell^{2}}=1}\lVert\Sigma u\rVert_{\ell^{2}}=\lVert{\bm{\nu}}\rVert_{\ell^{\infty}}=\max_{i=1,\ldots,\min\{k,l\}}\lvert\nu_{i}\rvert,$ which is similarly the maximal eigenvalue when $\Sigma$ is symmetric positive semi-definite.

The definition of the $p$ -Schatten norm involves taking the square root of a symmetric matrix. In general, a matrix square root is only unique up to unitary transformations. However, for symmetric positive semi-definite matrices, we will only require the unique symmetric positive semi-definite square root defined as follows.

Definition 1.2 (Matrix Square Root).

Let $A\in\mathbb{R}^{d\times d}$ be a symmetric positive semi-definite matrix with eigen-decomposition $A=UD{U}^{\mathrm{T}}$ where $U$ the orthonormal matrix of eigenvectors and $D$ the diagonal matrix of eigenvalues, $(\lambda_{1},\ldots,\lambda_{d})$ . Then, $A^{1/2}=UD^{1/2}{U}^{\mathrm{T}}$ where $D^{1/2}$ is the diagonal matrix with entries $(\lambda_{1}^{1/2},\ldots,\lambda_{d}^{1/2})$ .

Another family of norms that will be used is the collection of entrywise matrix norms denoted, which are written in terms of $\ell^{p}$ norms of the entries.

Definition 1.3 ( $(p,q)$ -Entrywise norm).

For an arbitrary matrix $\Sigma\in\mathbb{R}^{k\times l}$ with entries $\sigma_{i,j}$ and $p,q\in[1,\infty]$ , the $(p,q)$ -entrywise norm is $\lVert\Sigma\rVert_{p,q}=\left[\sum_{i=1}^{k}(\sum_{j=1}^{l}\sigma_{i,j}^{q})^{p/q}\right]^{1/p}$ with the usual modification in the case that $p=\infty$ and/or $q=\infty$ . When $p=q$ , these are the $\ell^{p}$ norms of a given matrix treated as a vector in $\mathbb{R}^{k,l}$ . Note that the 2-Schatten norm coincides with the $(2,2)$ -entrywise norm.

1.2 Main contributions and connections to past work

The main contribution of this work is the construction of a general framework for tuning threshold estimators for support recovery and estimation of sparse covariance matrices. It offers finite sample guarantees and a much faster compute time than computationally expensive optimization and cross validation methods.

Past work on thresholding estimators for sparse covariance estimation began with solely considering Gaussian data and then extending to sub-Gaussian tails (Bickel and Levina, 2008a, b; Rothman et al., 2009). The more recent work of Cai and Liu (2011) also provides theoretical results for sub-Gaussian data as well as certain polynomial-type tails. However, only Gaussian data is considered in their numerical simulations. In this article, we consider strongly log-concave, heavier tailed sub-exponential, and bounded data. While bounded data is, in fact, sub-Gaussian, the concentration behaviour of such data may be dependent on the dimension of the space compared to the much better behaved strongly log concave measures that also exhibit sub-Gaussian concentration.

The principal focus of this work is to use non-asymptotic concentration inequalities to guarantee finite sample performance. Past articles are focused on proximity of their estimator to truth in operator norm as the main metric of success due to convergence in operator norm implying convergence of the eigenvalues and eigenvectors. While asymptotically, such methods have elegant theoretical convergence properties, for finite samples one can achieve better performance in operator norm distance by simply choosing the empirical diagonal matrix as an estimator—that is, the empirical estimator with off-diagonal entries set to zero. In the supplementary material, we rerun some of the numerical simulations from Rothman et al. (2009) and demonstrate that for Gaussian data the empirical diagonal matrix achieves better performance than all of the universal threshold estimators for data in $\mathbb{R}^{500}$ for a sample of size $n=100$ . For sub-exponential data—albeit outside of the scope of their—the empirical diagonal matrix dominated all threshold estimators in operator norm distance even when $d<n$ . We thus strongly argue that the main metric of success for such sparse estimators is support recovery of the true covariance matrix.

The main theoretical results of this work are Theorem 3.1, which establishes how to fix a false positive rate for threshold estimators devoid of any distributional assumptions, and Theorem 4.2, which demonstrates support recovery—both zero and non-zero entries—in the specific case that the data has a strongly log concave measure. In the case that the data is instead sub-exponential or bounded, we do not achieve a similar limit theorem, but are still able to achieve good performance in numerical simulations. Of independent interest is Lemma 3.4, which establishes a symmetrization result for sparse random matrices making use of the techniques in Latała (2005).

2 Sparse Estimation Procedure

Let $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ be a sample of $n$ independent and identically distributed mean zero random vectors with unknown $d\times d$ covariance matrix $\Sigma$ . Define the empirical estimate of $\Sigma$ to be $\hat{\Sigma}^{\mathrm{emp}}=n^{-1}\sum_{i=1}^{n}(X_{i}-\bar{X}){(X_{i}-\bar{X})}^{\mathrm{T}}$ where $\bar{X}=n^{-1}\sum_{i=1}^{n}X_{i}$ . The goal of the following procedure is to construct a sparse estimator, $\hat{\Sigma}^{\mathrm{sp}}$ , for $\Sigma$ by first constructing a non-asymptotic confidence set for $\Sigma$ centred on $\hat{\Sigma}^{\mathrm{emp}}$ and then searching this set for the sparsest member. A search method using threshold estimators is outlined in Section 2.2.

The methodology is as follows:

i.

Choose a suitable false positive rate $\rho\in(0,1)$ , which will typically be close to zero. 2. ii.

Use Theorem 3.1 to determine the radius of a ball centred at $\hat{\Sigma}^{\mathrm{emp}}$ such that the sparsest matrices in that ball have false positive rate $\rho$ . 3. iii.

Use the binary search algorithm in Section 2.2 to identify the sparsest element in the above ball denoted $\hat{\Sigma}^{\mathrm{sp}}$ . 4. iv.

Considering this ball as a $(1-\alpha)$ -confidence set, use the concentration properties of the data to control the true positive rate.

Note that we will in practise normalize $\hat{\Sigma}^{\mathrm{emp}}$ to have unit diagonal in order to consistently recover the support.

2.1 Concentration Confidence Set

The first step is to construct a confidence set for $\Sigma$ about $\hat{\Sigma}^{\mathrm{emp}}$ . Theoretical justification of the following is provided in Section 3.

Given a false positive rate $0<\rho\leq 0.5$ , we construct a ball $B_{\rho}$ centred on $\hat{\Sigma}^{\mathrm{emp}}$ as follows.

i.

Find $\eta=2^{a}\rho\in(0.5,1]$ for some $a\in\mathbb{Z}^{+}$ . 2. ii.

Compute $\lambda$ , the $\eta$ -quantile of the magnitudes of the off-diagonal entries in $\hat{\Sigma}^{\mathrm{emp}}$ . That is, $\lambda>0$ is the smallest real number such that

[TABLE] 3. iii.

Apply hard thresholding to $\hat{\Sigma}^{\mathrm{emp}}$ with threshold $\lambda$ to get $\hat{\Sigma}^{\mathrm{emp}}_{\lambda}$ whose entries are

[TABLE]

which is, set off-diagonal entries to zero if they were originally less than $\lambda$ in magnitude. 4. iv.

Construct the operator norm ball about $\hat{\Sigma}^{\mathrm{emp}}$ of radius $r=2^{a}\lVert\hat{\Sigma}^{\mathrm{emp}}-\hat{\Sigma}^{\mathrm{emp}}_{\lambda}\rVert_{\infty}$ . 5. v.

Use a suitable concentration inequality to determine a bound on the coverage of this ball as a confidence set.

What we have now is

[TABLE]

This set will be searched for its sparsest member using the algorithm in the following subsection.

2.2 Thresholding within confidence sets

A generalized thresholding operator, as defined in Rothman et al. (2009), is $s_{\lambda}(\cdot):\mathbb{R}\rightarrow\mathbb{R}$ such that

[TABLE]

which will apply element-wise to a matrix. In the past, such an operator is applied to the empirical estimate $\hat{\Sigma}^{\mathrm{emp}}$ for some $\lambda$ generally chosen via cross validation. Instead of directly choosing a threshold $\lambda$ , our approach is to find the largest $\lambda$ such that $d(s_{\lambda}(\hat{\Sigma}^{\mathrm{emp}}),\hat{\Sigma}^{\mathrm{emp}})\leq r$ .

i.

Set $\hat{\Sigma}^{\mathrm{sp}}_{0}=(\hat{\Sigma}^{\mathrm{diag}})^{-1/2}(\hat{\Sigma}^{\mathrm{emp}})(\hat{\Sigma}^{\mathrm{diag}})^{-1/2}$ to be the empirical estimator normalized to have a diagonal of ones. Initialize the threshold to $\lambda=0.5$ and write $\hat{\Sigma}^{\mathrm{sp}}_{\lambda}=s_{\lambda}(\hat{\Sigma}^{\mathrm{emp}})$ . Let $k=1$ be the number of steps of the recursion. Choose a false positive rate $\rho$ and compute $r$ as in the previous section. 2. ii.

Increase $k\leftarrow k+1$ , then update $\lambda$ as follows.

(a)

if $d(\hat{\Sigma}^{\mathrm{sp}}_{\lambda},\hat{\Sigma}^{\mathrm{emp}})\leq r$ , set $\lambda\leftarrow\lambda+2^{-k-1}$ . 2. (b)

Otherwise, set $\lambda\leftarrow\lambda-2^{-k-1}$ . 3. iii.

Repeat step ii until $k$ has reached the desired number of iterations. Generally, as few as $k=10$ will suffice. 4. iv.

The resulting estimator is $\hat{\Sigma}^{\mathrm{sp}}=(\hat{\Sigma}^{\mathrm{diag}})^{1/2}(\hat{\Sigma}^{\mathrm{sp}}_{\lambda})(\hat{\Sigma}^{\mathrm{diag}})^{1/2}$ where $\hat{\Sigma}^{\mathrm{sp}}_{\lambda}$ is the final matrix resulting from this recursion.

Remark 2.1 (Positive Definite Estimators).

If $\hat{\Sigma}^{\mathrm{sp}}$ is not positive semi-definite, then it can be projected onto the space of positive semi-definite matrices. A standard past approach is to map the negative eigenvalues to zero or to their absolute value, which maintains the eigen-structure. However, such a projection will have an adverse effect on the support recovery problem as the estimator will no longer be sparse. An alternative is to map $\hat{\Sigma}^{\mathrm{sp}}\rightarrow\hat{\Sigma}^{\mathrm{sp}}+\gamma I_{d}$ for some $\gamma>0$ large enough to make the result positive definite. This will not effect the recovered support of the matrix. More clever projections may also be possible.

In the case that the metric $d(\cdot,\cdot)$ is a monotonically increasing function of the Hilbert-Schmidt / Frobenius norm $\lVert\hat{\Sigma}^{\mathrm{sp}}_{\lambda}-\hat{\Sigma}^{\mathrm{emp}}\rVert_{2}$ or another entrywise norm, then the sequence $d(\hat{\Sigma}^{\mathrm{sp}}_{\lambda},\hat{\Sigma}^{\mathrm{emp}})$ will be increasing in $\lambda$ .

Proposition 2.2.

In the context of the above algorithm, if $\lambda_{1}>\lambda_{2}$ , then for any $p,q$ , we have

[TABLE]

Proof.

As $\lambda_{1}>\lambda_{2}$ , the entries of the matrix $\hat{\Sigma}^{\mathrm{sp}}_{\lambda_{1}}-\hat{\Sigma}^{\mathrm{emp}}$ are equal to or larger in absolute value than the entries of $\hat{\Sigma}^{\mathrm{sp}}_{\lambda_{2}}-\hat{\Sigma}^{\mathrm{emp}}$ . Hence $\lVert\hat{\Sigma}^{\mathrm{sp}}_{\lambda_{1}}-\hat{\Sigma}^{\mathrm{emp}}\rVert_{p,q}\geq\lVert\hat{\Sigma}^{\mathrm{sp}}_{\lambda_{2}}-\hat{\Sigma}^{\mathrm{emp}}\rVert_{p,q}$ by definition 1.3. ∎

This property guarantees that the above algorithm will find the sparsest $\hat{\Sigma}^{\mathrm{sp}}$ in the confidence set in the sense of having the largest threshold possible. However, for an arbitrary metric or specifically other $p$ -Schatten norms, this sequence may not necessarily be strictly increasing in $\lambda$ . Another commonly used norm, which will be shown in Section 5 to give superior performance in simulation, is the operator norm $\lVert\hat{\Sigma}^{\mathrm{sp}}_{\lambda}-\hat{\Sigma}^{\mathrm{emp}}\rVert_{\infty}$ , which does not yield a monotonically increasing sequence. Though, this sequence is roughly increasing in the sense that it is lower bounded by definition by the maximum $\ell^{2}$ norm of the columns of $\hat{\Sigma}^{\mathrm{sp}}_{\lambda}-\hat{\Sigma}^{\mathrm{emp}}$ , which is an increasing sequence. Furthermore, it is upper bounded by the $\ell^{1}$ norm of the columns of $\hat{\Sigma}^{\mathrm{sp}}_{\lambda}-\hat{\Sigma}^{\mathrm{emp}}$ , which follows from the Gershgorin circle theorem (Iserles, 2009), and which is also an increasing sequence.

3 Fixing a false positive rate

For many sparse matrix estimation methods, theorems demonstrating sparsistency are proved. These indicate that in some asymptotic sense, the correct support of the true matrix will eventually be recovered generally as $n$ and $d$ grow together at some rate. However, none provide a method for fixing a false positive rate and finding an estimator that satisfies such a rate, which is certainly of interest to any practitioner with a finite fixed sample size. Hence, we present a method for tuning our parameter $\alpha$ to a desired false positive rate for the covariance estimator.

Before proceeding, we will require a class of sparse matrices similar to those from Bickel and Levina (2008a, b); Rothman et al. (2009); Cai and Liu (2011). Specifically, let

[TABLE]

For the results regarding the false positive rate, we are not concerned with the lower bound $\delta$ and only with $\kappa$ , the maximum number of non-zero entries per column or row. As long as $\kappa$ increases more slowly than the dimension $d$ , which is made specific below, we can achieve a desired false positive rate without interference.

For an estimator $\tilde{\Sigma}\in\mathbb{R}^{d\times d}$ , the false positive rate is

[TABLE]

where $\sigma_{i,j}$ is the $ij$ th entry of the true covariance matrix and $\tilde{\sigma}_{i,j}$ is the $ij$ th entry of the estimator $\tilde{\Sigma}$ . Hence, we are counting the number of non-zero entries in our estimator that should have been zero. For notation, let $\hat{\Sigma}^{\mathrm{emp}}$ be the usual empirical estimate of the covariance matrix. Let $\hat{\Sigma}^{\mathrm{emp}}_{0}$ be the empirical estimator with all off diagonal entries set to zero thus guaranteeing a false positive rate of zero. For $\eta\geq 0.5$ , let $\hat{\Sigma}^{\mathrm{emp}}_{\eta}$ be the empirical estimator after application of the strong threshold operator with threshold $M_{\eta}=\text{quantile}(\lvert\hat{\sigma}_{i,j}\rvert,\eta\,:\,i>j)$ , which removes $100(1-\eta)$ % of the off diagonal entries achieving a false positive rate of approximately $(1-\eta)$ due to the following lemma.

Lemma 3.1.

Let $\Sigma\in\mathcal{U}(\kappa,\delta)$ from Equation 3 with $\kappa=o(d^{\nu})$ . Let the $\eta\in[0.5,1)$ threshold, $M_{\eta}$ , be the $\eta$ quantile of $\lvert\hat{\sigma}_{i,j}\rvert$ with $i>j$ , and let the corresponding thresholded estimator be $\hat{\Sigma}^{\mathrm{emp}}_{\eta}=s_{M_{\eta}}(\hat{\Sigma}^{\mathrm{emp}})$ with $ij$ th entry denoted $\hat{\sigma}^{(\eta)}_{i,j}$ . Then, denoting $\hat{\eta}=\#\{(i,j)\,|\,i>j,\lvert\hat{\sigma}^{(\eta)}_{i,j}\rvert>0,\sigma_{i,j}=0\}[d(d-1)/2]^{-1}$ , we have that for $\varepsilon>0$

[TABLE]

for some constant $C>0$ .

Remark 3.2.

For this lemma, we want the $\eta$ -quantile of the mean zero entries, but have to work with the $\eta$ -quantile of the entire collection, which is contaminated by a small number of elements with non-zero mean. For $\nu<1$ , the error is $O(d^{\nu-1})$ hence for $\eta\approx 0.5$ , thresholding based on the $\eta$ -quantile suffices for large enough $d$ . For small $\eta$ , say $\eta\approx d^{-1}$ , we have to work harder motivating Theorem 3.1 below.

As noted in the remark, we cannot continue to threshold based on the sample quantiles for very small false positive rates. However, using the matrices, $\hat{\Sigma}^{\mathrm{emp}}_{\eta}$ and $\hat{\Sigma}^{\mathrm{emp}}_{0}$ , as reference points, we can interpolate via the following theorem to achieve any desired false positive rate.

Theorem 3.1.

Let $\Sigma\in\mathcal{U}(\kappa,\delta)$ from Equation 3 with $\kappa=O(d^{\nu})$ for $\nu<1/2$ . Given a desired false positive rate, $\rho\in(0,0.5]$ , and $\eta=\rho 2^{a}\in(0.5,1]$ for some $a\in\mathbb{Z}^{+}$ , let $\hat{\Sigma}^{\mathrm{emp}}_{\rho}$ be the hard thresholded empirical estimator that achieves a false positive rate of $\rho$ . Then,

[TABLE]

where $K_{1},K_{2}$ are universal constants.

Remark 3.3.

The above Theorem 3.1 is wholly uninteresting for large values of $n$ . However, its power arises in the non-asymptotic realm of interest—namely when $d\gg n$ —and also from highlighting the interplay between the dimension, sample size, and $\rho$ , the sparseness of the estimator. Furthermore, this result does not require any distributional assumption. It also does not require any assumption on the lower bound $\delta$ on the non-zero $\lvert\sigma_{i,j}\rvert$ as it is only concerned with the $\sigma_{i,j}$ that are zero.

The proof of the above theorem relies on the following lemma involving symmetrization of random covariance matrices, which may be of independent interest.

Lemma 3.4.

Let $R\in\mathbb{R}^{d\times d}$ be a real valued symmetric random matrix with zero diagonal and mean zero entries bounded by 1 and not necessarily iid, and let $B\in\{0,1\}^{d\times d}$ be an iid symmetric Bernoulli random matrix with entries $b_{i,j}=b_{j,i}\sim\mathrm{Bernoulli}\left(\rho\right)$ for $\rho\in(0,1)$ . Denoting the entrywise or Hadamard product by $\circ$ , let $A=R\circ B$ . Let $\mathcal{E}\in\{-1,1\}^{d\times d}$ be a symmetric random matrix with iid Rademacher entries $\varepsilon_{i,j}$ for $j<i$ and $\varepsilon_{i,j}=\varepsilon_{j,i}$ . Then,

[TABLE]

where $K_{1},K_{2}$ are universal constants.

4 Concentration Confidence Sets

The following three subsections detail different assumptions on the data under scrutiny and the specific concentration results that apply in these cases. We consider sub-Gaussian concentration for log concave measures and for bounded random variables. We also consider sub-exponential concentration. However, this collection is by no means exhaustive. Given the wide variety of concentration inequalities being developed, our approach can be applied much more widely than to merely these three settings.

Let $d(\cdot,\cdot)$ be some metric measuring the distance between two covariance matrices, and let $\psi:\mathbb{R}\rightarrow\mathbb{R}$ be monotonically increasing. Then, the general form of the concentration inequalities is

[TABLE]

which is a bound on the tail of the distribution of $d(\Sigma,\hat{\Sigma}^{\mathrm{emp}})$ as it deviates above its mean. Thus, to construct a $(1-\alpha)$ -confidence set, the variable $r=r_{\alpha}$ is chosen such that $\exp(-\psi(r_{\alpha}))=\alpha$ .

Now, let $\hat{\Sigma}^{\mathrm{sp}}$ be our sparse estimator for $\Sigma$ . We want these two to be close in the sense of the above confidence set and therefore choose a $\hat{\Sigma}^{\mathrm{sp}}$ such that $d(\hat{\Sigma}^{\mathrm{sp}},\hat{\Sigma}^{\mathrm{emp}})\leq r_{\alpha}$ . Consequently, we have that

[TABLE]

Hence, we choose $\hat{\Sigma}^{\mathrm{sp}}$ close enough to $\hat{\Sigma}^{\mathrm{emp}}$ to share its elegant concentration properties, but far enough away to result in a better estimator for $\Sigma$ .

4.1 Log Concave Measures

In this section, the general methods from Section 2 are specialized for an iid sample $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ whose common measure $\mu$ is strongly log-concave. This property implies dimension-free sub-Gaussian concentration and includes such common distributions as the multivariate Gaussian, Chi, and Dirichlet distributions.

Definition 4.1 (Strongly log-concave measure).

A measure $\mu$ on $\mathbb{R}^{d}$ is strongly log-concave if there exists a $c>0$ such that $d\mu=\mathrm{e}^{-U(x)}dx$ and $\text{Hess}(U)-cI_{d}\geq 0$ (i.e. is non-negative definite) where $\text{Hess}(U)$ is the $d\times d$ matrix of second derivatives.

From Corollary S4.5let $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ have measures $\mu_{1},\ldots,\mu_{n}$ , which are all strongly log-concave with coefficients $c_{1},\ldots,c_{n}$ . Let $\nu=\mu_{1}\otimes\ldots\otimes\mu_{n}$ be the product measure on $\mathbb{R}^{d\times n}$ . Then, for any $1$ -Lipschitz $\phi:(\mathbb{R}^{d})^{n}\rightarrow\mathbb{R}$ and for any $r>0$ ,

[TABLE]

This follows from Theorem S4.4and the other results contained within the supplementary material.For a detailed exposition of how sub-Gaussian concentration is established for log concave measures, see Chapter 5 of Ledoux (2001). Examples include the multivariate Gaussian and the Dirichlet distributions.

To make use of the above result, we must choose a suitable Lipschitz function $\phi(\cdot)$ . Let $X_{1},\ldots,X_{n},X\in\mathbb{R}^{d}$ be independent and identically distributed random variables with covariance $\Sigma$ and with a common strongly log-concave measure $\mu$ with coefficient $c>0$ . Let $\lambda_{1}\geq\ldots\geq\lambda_{n}$ be the eigenvalues of $\Sigma$ and $\Lambda=(\lambda_{1},\ldots,\lambda_{n})$ . For some $p\in[1,\infty]$ , let $\lVert\cdot\rVert_{p}$ be the $p$ -Schatten norm, which in this case is $\lVert\Sigma\rVert_{p}=\lVert\Lambda\rVert_{\ell^{p}}$ . Note that $\lVert X{X}^{\mathrm{T}}\rVert_{p}=\lVert X\rVert_{\ell^{2}}^{2}$ for any $p\in[1,\infty]$ . Define the function $\phi$ to be $\phi(X_{1},\ldots,X_{n})=\left\lVert\frac{1}{n}\sum_{i=1}^{n}(X_{i}-\mathrm{E}X){(X_{i}-\mathrm{E}X)}^{\mathrm{T}}\right\rVert_{p}^{1/2}.$

For $p\in\{1,2,\infty\}$ , we have that $\phi$ is Lipschitz with coefficient $\lVert\phi\rVert_{\mathrm{Lip}}=n^{-1/2}$ with respect to the Frobenius or Hilbert-Schmidt metric, which is established in Proposition S3.5for $p=2$ and $p=\infty$ and in Proposition S3.2for $p=1$ . That is, let $X_{1},\ldots,X_{n},Y_{1},\ldots,Y_{n}\in\mathbb{R}^{d}$ , and denote ${\bf X}=(X_{1},\ldots,X_{n})$ and ${\bf Y}=(Y_{1},\ldots,Y_{n})$ , then $\lvert\phi({\bf X})-\phi({\bf Y})\rvert\leq n^{-1/2}d_{2,2}({\bf X},{\bf Y})=\left(\frac{1}{n}\sum_{i=1}^{n}\lVert X_{i}-Y_{i}\rVert_{\ell^{2}}^{2}\right)^{1/2}.$ From here, the procedure outlined in Section 2 can be considered with the given $\phi$ and $r_{\alpha}=\sqrt{(-2/nc_{0})\log\alpha}$ .

In many cases, including the two examples above, the constructed confidence set is completely dimension-free. Thus, even mild assumptions on the relationship between the sample size $n$ and the dimension $d$ , such as $\log d=o(n^{1/3})$ from the adaptive soft thresholding estimator of Cai and Liu (2011), are not needed to prove consistency in our setting. Furthermore, the concentration inequalities immediately give us a fast rate of convergence as long as $-\log\alpha=o(n)$ with a proof provided in the supplementary material.

Theorem 4.1.

Let $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ be iid with common measure $\mu$ . Let $\mu$ be strictly log concave with some fixed constant $c_{0}$ from Definition 4.1. Then, for $\alpha\in(0,1)$ , $p\in[1,\infty]$ , and $r_{\alpha}=\sqrt{(-2/nc_{0})\log\alpha}$ ,

[TABLE]

Remark 4.2.

This theorem effectively says that choosing an estimator in the ball centred around $\hat{\Sigma}^{\mathrm{emp}}$ cannot be too bad assuming the niceness of log-concave measures. It also tells us how fast we can shrink the ball as $n$ increases.

A second and arguably more important issue, see the supplementary material,in the setting of sparse covariance estimation is that of support recovery or “sparsistency” (Lam and Fan, 2009; Rothman et al., 2009). To recover the support of a covariance matrix—that is, determine which entries $\sigma_{i,j}\neq 0$ —we will require a class of sparse matrices from Equation 3. In past work, a notation of “approximate sparsity” is considered where the first condition in $\mathcal{U}(\kappa,\delta)$ is replaced with $\max_{i=1,\ldots,d}\sum_{i=1}^{d}\lvert\sigma_{i,j}\rvert^{q}<\kappa$ for $q\in[0,1)$ . However, once we bound the non-zero entries away from zero by some $\delta$ , such “approximate sparsity” implies standard sparsity with $q=0$ . It is worth noting that the above Proposition 4.1 does not require such a sparsity class, because our estimator is forced to remain close enough to $\hat{\Sigma}^{\mathrm{emp}}$ to follow $\hat{\Sigma}^{\mathrm{emp}}$ ’s convergence to $\Sigma$ .

Theorem 4.2.

Let $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ be iid with common measure $\mu$ . Let $\mu$ be strictly log concave with some fixed constant $c_{0}$ from Definition 4.1. Furthermore, let $\Sigma\in\mathcal{U}(\kappa,\delta)$ and let $\delta=O(n^{-1+\varepsilon})$ for any $\varepsilon>0$ . Then, for $\hat{\Sigma}^{\mathrm{sp}}$ denoting the concentration estimator using the hard thresholding estimation from Section 2.2 with the operator norm metric,

[TABLE]

where $\mathrm{supp}(\Sigma)=\{(i,j):\sigma_{i,j}\neq 0\}$ .

Remark 4.3.

Note that the condition that $\delta=O(n^{-1+\varepsilon})$ allows for a much quicker decay of the non-zero entries of $\Sigma$ than in El Karoui (2008) where the lower bound is of the form $Cn^{-\alpha_{0}}$ with $0<\alpha_{0}<1/2$ . It is also much quicker than the similar rate achieved in Rothman et al. (2009) where the lower bound is any $\tau$ such that $\sqrt{n}\tau$ increases faster than $\sqrt{\log(d)}$ with the enforced asymptotic condition that $\log(d)/n=o(1)$ resulting in a rate no faster than $n^{-1/2}$ . Though, it is worth noting that if $\delta$ decays to zero at a faster rate, then the above convergence rate for support recovery slows as can be seen in the proof.

5 Numerical simulations

In the following subsections, we apply the methods from the previous sections to three multivariate distributions of interest: the Gaussian, Laplace, and Rademacher distributions. In doing so, we apply Theorem 3.1 to analytically determine the ideal confidence ball radius in order to construct a sparse estimator of $\Sigma$ . We also compare the support recovery of our approach against penalized estimators and standard application of universal threshold estimators.

As mentioned before, our proposed concentration confidence set based method has a similar feel to regularized / penalized estimators as the larger the constructed confidence set is, the sparser the returned estimator will be. Thus, we compare our approach with the following lasso style estimator from the R package PDSCE (Rothman, 2013), which optimizes

[TABLE]

with $\tau,\lambda>0$ . Here, the $\log\det$ term is used to enforce positive definiteness of the final solution, and $\lVert\cdot\rVert_{\ell^{1}}$ is the lasso style penalty, which enforces sparsity.

The similar method from the R package spcov (Bien and Tibshirani, 2012), which uses a majorize-minimize algorithm to determine

[TABLE]

for some penalization $\lambda>0$ , was also considered but proved to run too slowly on high dimensional matrices—that is, $d\geq 200$ —to be included in the numerical experiments.

Of course, we also compare our method against the four universal thresholding estimators applied to the empirical covariance matrix from (Rothman et al., 2009), Hard, Soft, SCAD, and Adaptive LASSO:

[TABLE]

where $\hat{\sigma}_{i,j}$ is the $(i,j)$ th entry of the empirical covariance estimate, $a=3.7$ , and $\eta=1$ . The parameter $\lambda>0$ is the threshold, which is chosen in practice via cross validation with respect to the Hilbert-Schmidt norm. Briefly, the data is split in half, two empirical estimators are formed, one is thresholded, and $\lambda$ is selected to minimize the Hilbert-Schmidt distance between the one empirical estimate and the other thresholded estimate.

5.1 Multivariate Gaussian Data

Let $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ be independent and identically distributed mean zero random vectors with a strictly log concave measure and covariance matrix $\Sigma$ . By Corollary S4.5,there exists a constant $c_{0}>0$ such that $\mathrm{P}\left(\phi({\bf X})\geq\mathrm{E}\phi({\bf X})+r\right)\leq\mathrm{e}^{-nr^{2}/2c_{0}}$ where $\phi({\bf X})=\lVert\hat{\Sigma}^{\mathrm{emp}}-\Sigma\rVert_{p}^{1/2}$ where $\hat{\Sigma}^{\mathrm{emp}}=n^{-1}\sum_{i=1}^{n}(X_{i}-\bar{X}){(X_{i}-\bar{X})}^{\mathrm{T}}$ is the empirical estimate of the covariance matrix. This results in the size $1-\alpha$ confidence set for $\Sigma$

[TABLE]

for $\alpha\in(0,1)$ . In the notation of Section 4.1, $r_{\alpha}=\sqrt{(-2c_{0}/n)\log\alpha}$ .

In the multivariate Gaussian case, $c_{0}$ is the maximal eigenvalue of the covariance matrix $\Sigma$ . As mentioned before, we avoid any issues of estimating $c_{0}$ in practice. Regardless of our choice for $c_{0}$ tuning the regularization parameter $\alpha$ to a specific false positive rate negates the need for an accurate estimate of $c_{0}$ .

Table 1 displays false positive and true positive percentages for seven sparse estimators computed over 100 replications of a random sample of size $n=50$ of $d=50,100,200,500$ dimensional multivariate Gaussian data with a tri-diagonal covariance matrix $\Sigma$ whose diagonal entries are 1 and whose off-diagonal entries are 0.3. We can clearly see that the concentration-based estimator approaches the desired false positive rate—either 1% or 5%—as the dimension increases. In contrast, the thresholding estimators with threshold $\lambda$ chosen via cross validation generally start with higher false positive percentages, which tend to zero as the dimension increases. As noted in previous work, hard thresholding is overly aggressive. The PDS method is very stable across changes in the dimension and maintains a constant 3.4% false positive rate and 50% true positive rate.

5.2 Multivariate Laplace Data

There are many possible ways to extend the univariate Laplace distribution, also referred to as the double exponential distribution, onto $\mathbb{R}^{d}$ . For the following simulation study, we choose the extension detailed in Eltoft et al. (2006). Namely, let $Z\sim\mathcal{N}\left(0,\sigma^{2}\right)$ and let $V\sim\mathrm{Exponential}\left(1\right)$ . Then, $X=\sqrt{V}Z\sim\mathrm{Laplace}\left(\sigma/\sqrt{2}\right)$ , which has pdf $f(x)=\sqrt{2}\sigma^{-1}\exp(-\sqrt{2}\lvert x\rvert/\sigma)$ and variance $\mathrm{Var}\left(X\right)=\sigma^{2}$ . For the multivariate setting, now let $Z\in\mathbb{R}^{d}$ be multivariate Gaussian with zero mean and covariance $\Sigma$ and, once again, let $V\sim\mathrm{Exponential}\left(1\right)$ . Then, we declare $X=\sqrt{V}Z$ to have a multivariate Laplace distribution with zero mean and covariance $\Sigma$ .

Table 2 displays false positive and true positive percentages for seven sparse estimators computed over 100 replications of a random sample of size $n=50$ of $d=50,100,200,500$ dimensional multivariate Laplace data with a tri-diagonal covariance matrix $\Sigma$ whose diagonal entries are 1 and whose off-diagonal entries are 0.3. Similarly to the previous setting, the concentration-based estimator approaches the desired false positive rate—either 1% or 5%—as the dimension increases. All universal thresholding estimators set most of the entries in the matrix to zero when threshold $\lambda$ chosen via cross validation. The PDS method is still stable across changes in the dimension but fixates on a much higher false positive rate around 12.5% and a similar true positive rate of 51%.

5.3 Small Round Blue-Cell Tumour Data

Following the same analysis performed in Rothman et al. (2009) and subsequently in Cai and Liu (2011), we will consider the data set resulting from the small round blue-cell tumour (SRBCT) microarray experiment (Khan et al., 2001). The data set consists of a training set of 64 vectors containing 2308 gene expressions. The data contains four types of tumours denoted EWS, BL-NHL, NB, and RMS. As performed in the two previous papers, the genes are ranked by their respective amount of discriminative information according to their $F$ -statistic

[TABLE]

where $\bar{x}$ is the sample mean, $k=4$ is the number of classes, $n=64$ is the sample size, $n_{m}$ is the sample size of class $m$ , and likewise, $\bar{x}_{m}$ and $\hat{\sigma}_{m}^{2}$ are, respectively, the sample mean and variance of class $m$ . The top 40 and bottom 160 scoring genes were selected to provide a mix of the most and least informative genes.

Table 3 displays the results of applying the four threshold estimators with cross validation, the PDS method, and our concentration-based thresholding with the sub-Gaussian formula and with false positive rates of 10, 5, and 1 percent. The percentage of matrix entries that are retained for the most informative $40\times 40$ block and the least informative block are tabulated. Depending on the chosen false positive rate, our concentration-based estimators give similar results to Soft and SCAD thresholding. PDS is the least conservative of the methods as it keeps the most entries. Hard and Adaptive LASSO thresholding are the most aggressive methods.

It is also worth noting that our method is computationally efficient enough to consider the entire $2308\times 2308$ matrix at once. In fact, it took only 131.3 seconds to compute $\hat{\Sigma}^{\mathrm{sp}}$ on an Intel i7-7567U CPU, 3.50GHz. In contrast, the PDS method, which still has significantly faster run times than cross validating the threshold estimators, took over 101 minutes to finish. False positive rates of 5%, 1%, and 0.1% were tested. The fraction of non-zero entries in $\hat{\Sigma}^{\mathrm{sp}}$ was 8.6%, 2.0%, and 0.22%, respectively. For comparison, the fraction of non-zero entries retained by PDS was 17.7%. If such an analysis is meant to lead to follow-up research on specific gene pairings, then culling as many false positives as possible is of critical importance. The sparse covariance estimator was partitioned into $12\times 12$ blocks and the number of non-zero entries was tabulated for each. The results are displayed in Figure 2.

6 Supplementary Material

The supplementary material consists of five sections. The first parallels Section 4.1 and considers sub-exponential measures and bounded random variables as well as some additional simulations for multivariate Rademacher random variables. The second contains proofs of the lemmas and theorems presented in the main article. The third contains additional simulations motivating why our support recovery approach is better than past approaches. The fourth contains derivations of Lipschitz coefficients for the functions used in Section 4. The fifth is expository and contains past results from the concentraton of measure literature that were directly used in this work.

Appendix A Sub-Exponential and Bounded Data

In line with our discussion of log concanve measures in the main article, we include some information on sub-exponential measures and data that is bounded.

A.1 Sub-Exponential Distributions

Compared with the previously discussed measures with sub-Gaussian concentration, there exists a larger class of measures with sub-exponential concentration. Such measures can be specified as those that satisfy the Poincaré or spectral gap inequality [Bobkov and Ledoux, 1997, Ledoux, 2001, Gozlan, 2010]. For a random variable $X$ on $\mathbb{R}^{d}$ with measure $\mu$ , this is

[TABLE]

for some $C>0$ and for all locally Lipschitz functions $f$ .

If $X$ satisfies such an inequality, then—see Theorem S4.6or Chapter 5 of Ledoux [2001]—for for $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ iid copies of $X$ and for some Lipschitz function $\phi:\mathbb{R}^{d\times n}\rightarrow\mathbb{R}$ ,

[TABLE]

where $K>0$ in a constant depending only on $C$ and

[TABLE]

As in the log concave setting discussed in the main paper, $\phi$ is chosen to be

[TABLE]

which is Lipschitz with constant $n^{-1/2}$ . This results in values of $a^{2}=1$ and $b=n^{-1/2}$ for the above coefficients. Hence, the radius in this setting is computed to be $r_{\alpha}=\max\{-K\log\alpha/\sqrt{n},\sqrt{-K\log\alpha}\}$ . While an optimal (or reasonable) value for $K$ may not be known, it makes little difference given the proposed procedure for choosing $\alpha$ detailed in the main paper for a desired false positive rate. This is because the term $-K\log\alpha$ will be equivalently tuned to determine the optimal size of the constructed confidence set.

As $r_{\alpha}$ in this setting is bounded below by a constant $\sqrt{-K\log\alpha}$ , we do not achieve the nice convergence results as in the log concave setting. However, the dimension-free concentration still allows for good performance in simulation settings as was seen in Section 6.

A.2 Bounded Random Variables

In this section, we consider random variables that are bounded in some norm. Consider a Banach space $(B,\lVert\cdot\rVert)$ and a collection of iid random variables $X_{1},\ldots,X_{n}\in B$ such that $\lVert X_{i}\rVert\leq U$ for all $i=1,\ldots,n$ . Given only this assumption, the bounded differences inequality, detailed in the supplementary material and in Section 3.3.4 of Giné and Nickl [2016], can be applied in this specific setting. It provides sub-Gaussian concentration for such random variables.

Specifically, let $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ be iid with $\lVert X_{i}\rVert_{\ell^{2}}\leq U$ for $i=1,\ldots,n$ . Then, for any $p\in[1,\infty]$ , $\lVert X_{i}{X_{i}}^{\mathrm{T}}\rVert_{p}\leq U^{2}$ , and

[TABLE]

This follows immediately from Theorem S4.8.

Hence, for any collection of real valued random vectors bounded in Euclidean norm, the bounded differences inequality can be applied to the empirical estimate for any of the $p$ -Schatten norms. The radius is $r_{\alpha}=U\sqrt{(1/2n)\log\alpha}$ . However, unlike in the previous setting, the bounds may not necessarily be dimension free.

Example A.1 (Distributions on the Hypercube).

If the components $\lvert X_{i,j}\rvert\leq 1$ such as for multivariate uniform or Rademacher random variables, then $U={d}^{1/2}$ . Consequently, $r_{\alpha}=O(\sqrt{d/n})$ is not dimension free. While this makes estimation with respect to operator norm distance challenging, we can still use Theorem 1 to fix the false positive rate.

A.3 Simulations on High Dimensional Binary Vectors

Random binary vectors fall into the category of bounded random variables, which have sub-Gaussian concentration as a consequence of the bounded differences inequality—an extension of Hölder’s inequality—as discussed in Section A.2. The result is a slightly different form for the confidence balls compared with the log concave setting. And while the concentration behaviour in this setting relies on the dimension and is poor for producing an estimator that is close in operator or Hilbert-Schmidt norm, our support recovery methodology is still able to perform well in this setting.

Table 4 displays false positive and true positive percentages for seven sparse estimators computed over 100 replications of a random sample of size $n=50$ of $d=50,100,200,500$ dimensional multivariate Rademacher data with a tri-diagonal covariance matrix $\Sigma$ whose diagonal entries are 1 and whose off-diagonal entries are 0.3. As a consequence of the bounded differences inequality, this case also exhibits sub-Gaussian behaviour. As a consequence, Table 4 is similar to Table 1 from the main article. Concentration estimators perform better as $d$ increases; Threshold estimators are overly aggressive as $d$ increases; And the PDS method’s support recover is unaffected by the change in $d$ .

Appendix B Proofs

Proof of Lemma 1.

We begin with the collection of $N=d(d-1)/2$ random variables $\hat{\sigma}_{i,j}=n^{-1}\sum_{k=1}^{n}X_{k,i}X_{k,j}$ , which we will denote $Z_{1},\ldots,Z_{N}$ . Without loss of generality, assume that $Z_{1},\ldots,Z_{N_{0}}$ have mean zero and $Z_{N_{0}+1},\ldots,Z_{N_{1}+N_{0}}$ have nonzero mean and $N=N_{0}+N_{1}$ . To achieve $\eta$ false positives, we would find the index $k_{0}$ corresponding to the $\lfloor(1-\eta)N_{0}\rfloor$ order statistic of the $Z_{1},\ldots,Z_{N_{0}}$ , and set all entries $\lvert Z_{i}\rvert\leq\lvert Z_{k_{0}}\rvert$ to zero. Instead, we find the index $\hat{k}$ corresponding to the ${\lfloor(1-\eta)N\rfloor}$ order statistic of all the $Z_{i}$ .

Given that $\Sigma\in\mathcal{U}(\kappa,\delta)$ , we have

[TABLE]

Thus, when considering the achieved false positive rate $\#\{\lvert Z_{i}\rvert<\lvert Z_{\hat{k}}\rvert|i\leq N_{0}\}/N_{0}$ to the target rate $\#\{\lvert Z_{i}\rvert<\lvert Z_{k_{0}}\rvert|i\leq N_{0}\}/N_{0}$ , we have

[TABLE]

∎

Proof of Lemma 2.

This proof follows from the result of Latała [2005] Theorem 2—also found in Theorem 2.3.8 of Tao [2012]—without the assumption of iid entries in the random matrix but with many entries equal to zero.

We first apply the expectation with respect to $\mathcal{E}$ and use the result from Latała [2005].

[TABLE]

with $K_{1},K_{2}$ universal constants. For the second term in the above equation, we have via Jensen’s inequality and the fact that $\lvert a_{i,j}\rvert\leq 1$ that

[TABLE]

For the first term in the above equation, we make use of the fact that $\lvert a_{i,j}\rvert\leq 1$ and that only $\rho$ are non-zero resulting in

[TABLE]

Combining the above results and updating the constants $K_{1},K_{2}$ as necessary gives the desired result

[TABLE]

∎

Proof of Theorem 1.

Without loss of generality, we can normalize $\hat{\Sigma}^{\mathrm{emp}}$ such that the diagonal entries are 1. Thus $\hat{\Sigma}^{\mathrm{emp}}_{0}=I_{d}$ , the $d$ dimensional identity matrix, and the off-diagonal entries of all matrices considered will be bounded in absolute value by one.

For the empirical covariance estimator, $\lVert\hat{\Sigma}^{\mathrm{emp}}-\hat{\Sigma}^{\mathrm{emp}}_{0}\rVert_{\infty}=\lVert\hat{\Sigma}^{\mathrm{emp}}\rVert_{\infty}-1$ . We can decompose $\hat{\Sigma}^{\mathrm{emp}}$ into three parts: the diagonal of ones; the off-diagonal terms corresponding to $\sigma_{i,j}\neq 0$ ; and the off-diagonal terms corresponding to $\sigma_{i,j}=0$ . The number of non-zero off-diagonal terms is bounded in each row/column by $\kappa$ . Hence,

[TABLE]

where $\hat{\Sigma}^{\mathrm{emp}}_{=0}$ has entries $\hat{\sigma}_{i,j}$ such that $\mathrm{E}\hat{\sigma}_{i,j}=0$ .

Let the entrywise or Hadamard product of two similar matrices $A$ and $B$ be $A\circ B$ with entry $ij$ th entry $(a_{i,j}b_{i,j})_{i,j}$ . For ease of notation, we denote $\Pi_{0}=\hat{\Sigma}^{\mathrm{emp}}_{=0}$ . Let $\Pi_{1}$ be the result of randomly removing half of the entries from $\Pi_{0}$ , which is $\Pi_{1}=\Pi_{0}\circ B$ where $B\in\{0,1\}^{d\times d}$ is a symmetric random matrix with iid $\mathrm{Bernoulli}\left(1/2\right)$ entries. Considering the corresponding symmetric Rademacher random matrix, $\mathcal{E}=2B-1$ , we then have that

[TABLE]

where the $\pm$ comes from the symmetry of $\mathcal{E}$ . Thus,

[TABLE]

This idea can be iterated. Let $\Pi_{m}=\Pi_{0}\circ B_{1}\circ\ldots\circ B_{m}$ with the $B_{i}$ iid copies of $B$ from before. Then, similarly,

[TABLE]

Moreover,

[TABLE]

Applying Lemma 3.5 $m$ times and updating universal constants $K_{1},K_{2}$ as necessary results in

[TABLE]

Thus, for $\rho=2^{-m}$ , we have

[TABLE]

We want to replace the $\Pi_{m}$ with $\hat{\Sigma}^{\mathrm{emp}}_{\rho}-\hat{\Sigma}^{\mathrm{emp}}_{0}$ and similarly for $\Pi_{0}$ . The off-diagonal entries such that $\sigma_{i,j}\neq 0$ can contribute at most $\kappa=o(d^{\nu})$ , $\nu<1$ , to the operator norm. Hence,

[TABLE]

We lastly apply the crude—but effective in the non-asymptotic setting—bound $\lVert\hat{\Sigma}^{\mathrm{emp}}\rVert_{\infty}\geq d/n$ almost surely. Dividing by $\mathrm{E}\lVert\hat{\Sigma}^{\mathrm{emp}}-\hat{\Sigma}^{\mathrm{emp}}_{0}\rVert_{\infty}$ results in

[TABLE]

Thus, we require $\nu<1/2$ to make the final term negligible for large $d$ with respect to the others.

We can extend this result to arbitrary $\rho\in(0,0.5]$ by using the simple observation that given such a $\rho$ , there exists an $a\in\mathbb{Z}^{+}$ such that $2^{a}\rho\in[0.5,1)$ . Therefore, setting $\eta=2^{a}\rho$ and replacing $\hat{\Sigma}^{\mathrm{emp}}$ with the corresponding matrix $\hat{\Sigma}^{\mathrm{emp}}_{\eta}$ from Lemma 3.1allows us to proceed as above. ∎

Proof of Theorem 2.

From the derivation in Section 2we have that

[TABLE]

for any $\hat{\Sigma}^{\mathrm{sp}}$ such that $\lVert\hat{\Sigma}^{\mathrm{sp}}-\hat{\Sigma}^{\mathrm{emp}}\rVert\leq r_{\alpha}$ . Writing $Z=\left\lVert\hat{\Sigma}^{\mathrm{emp}}-\Sigma\right\rVert_{p}$ and $Y=\left\lVert\hat{\Sigma}^{\mathrm{sp}}-\Sigma\right\rVert_{p}$ and squaring and rearranging the terms gives,

[TABLE]

Given the standard convergence result for the empirical covariance matrix that $\mathrm{E}\lVert\hat{\Sigma}^{\mathrm{emp}}-\Sigma\rVert_{p}=O(n^{-1/2})$ and our definition of $r_{\alpha}=O(n^{-1/2}\sqrt{-\log\alpha})$ , we now have that

[TABLE]

which holds for any $\hat{\Sigma}^{\mathrm{sp}}$ such that $\lVert\hat{\Sigma}^{\mathrm{sp}}-\hat{\Sigma}^{\mathrm{emp}}\rVert\leq r_{\alpha}$ . ∎

Proof of Theorem 3.

Let $\hat{\Sigma}^{\star}=\{\hat{\sigma}_{i,j}\bm{1}\!\left[\sigma_{i,j}\neq 0\right]\}$ be the result of a perfect thresholding of the empirical covariance estimator. That is, $\hat{\Sigma}^{\star}$ has support identical to the true $\Sigma$ and non-zero entries that coincide with $\hat{\Sigma}^{\mathrm{emp}}$ . Furthermore, let $\tilde{\Sigma}$ be some other overly-sparse covariance estimator resulting from zeroing entries in $\hat{\Sigma}^{\mathrm{emp}}$ , but with more zeros than $\Sigma$ . For a radius $r_{\alpha}$ , $\hat{\Sigma}^{\mathrm{sp}}$ is the sparsest element in the corresponding confidence ball.

[TABLE]

which, assuming a large enough sample size $n$ , are the two mutually exclusive events that the estimator with correct support $\hat{\Sigma}^{\star}$ is not in the ball of radius $r_{\alpha}$ and that a sparser estimator $\tilde{\Sigma}$ is in the ball.

For the first term in Equation B.1, we show that the probability that a matrix with the correct support lying outside of the confidence set will tend to zero.

[TABLE]

For $(\mathrm{II})$ , we have that $\mathrm{E}\lVert\hat{\Sigma}^{\mathrm{emp}}-\Sigma\rVert=O(n^{-1/2})$ and that $r_{\alpha}^{2}=O(n^{-1}\log\alpha)$ . Let $Z=\lVert\hat{\Sigma}^{\mathrm{emp}}-\Sigma\rVert_{\infty}^{1/2}$ for simplicity of notation. Then, using the concentration result already established for Lipschitz functions of log concave measures,

[TABLE]

for some positive $C=o(1)$ .

For $(\mathrm{I})$ , applying the Gershgorin circle theorem [Iserles, 2009] to the operator norm gives

[TABLE]

where $\mathrm{supp_{\text{col}}}(\Sigma)=\max_{j=1,\ldots,d}\lvert\{(i,j):\sigma_{i,j}\neq 0\}\rvert$ is the maximal number of non-zero entries in any given column. From Proposition D.5, we have that $\lVert\hat{\Sigma}^{\mathrm{emp}}-\Sigma\rVert_{2}^{1/2}$ is Lipschitz with constant $n^{1/2}$ . As the squared Frobenius norm is equal to the sum of the squares of the entries of the matrix, we in turn have that the entries $\lvert\hat{\sigma}_{i,j}-\sigma_{i,j}\rvert^{1/2}$ are also Lipschitz with constant $n^{1/2}$ . As the maximum of $d^{2}$ Lipschitz functions is also still Lipschitz, we get similarly to case $(\mathrm{II})$ that $(\mathrm{I})\leq C\alpha^{\varepsilon}$ for some $\varepsilon>0$ .

For the second term in Equation B.1, we show that the probability of any sparser matrix than $\hat{\Sigma}^{\star}$ existing in the confidence ball goes to zero. Let $\mathrm{supp}(\tilde{\Sigma})\subset\mathrm{supp}(\Sigma)$ . Then, there exists a pair of indices $(i_{0},j_{0})\in\mathrm{supp}(\Sigma)$ such that $(i_{0},j_{0})\notin\mathrm{supp}(\tilde{\Sigma})$ .

[TABLE]

We have that if $\sigma_{i,j}\neq 0$ then $\lvert\sigma_{i,j}\rvert>\delta>0$ . Hence, $\hat{\sigma}_{i_{0},j_{0}}=(\hat{\sigma}_{i_{0},j_{0}}-{\sigma}_{i_{0},j_{0}})+{\sigma}_{i_{0},j_{0}}\geq o_{p}(n^{-1/2})+\delta.$ Meanwhile, $r_{\alpha}=O(n^{-1/2})$ . Thus, $\mathrm{P}\left(\hat{\sigma}_{i_{0},j_{0}}\leq r_{\alpha}^{2}\right)\rightarrow 0$ as $n\rightarrow\infty$ as long as $\delta=o(n^{-1})$ . ∎

Appendix C Estimation with the Empirical Diagonal

In this section, we demonstrate that the distance in operator norm is an insufficient metric to use for the comparison of estimators for large sparse covariance matrices in the non-asymptotic setting. The operator norm’s usage in past research [Bickel and Levina, 2008a, b, El Karoui, 2008, Rothman et al., 2009] stems from the result that “convergence in operator norm implies convergence of the eigenvalues and eigenvectors.” However, this does not imply strong performance for finite samples. We demonstrate this by showing that the naive empirical diagonal covariance matrix—that is, the estimator $\hat{\Sigma}^{\mathrm{diag}}$ with $\hat{\Sigma}^{\mathrm{diag}}_{i,j}=\hat{\Sigma}^{\mathrm{emp}}_{i,j}$ if $i=j$ and $\hat{\Sigma}^{\mathrm{diag}}=0$ otherwise—performs better in operator norm for finite samples.

The simulation study from Rothman et al. [2009] was reproduced where four threshold estimators—hard, soft, SCAD, and adaptive LASSO—were applied to estimating the covariance matrix for a sample of $n=100$ random normal vectors in dimensions $d=30,100,200,500$ for three different models. We consider models 1 and 2, which respectively are autoregressive covariance matrices with entries $\sigma_{i,j}=\rho^{\lvert i-j\rvert}$ and moving average covariance matrices with entries $\sigma_{i,j}=\rho\bm{1}_{\lvert i-j\rvert=1}+\bm{1}_{i=j}$ . In both cases, we set $\rho=0.3$ . The simulations were replicated 100 times and averaged. The results are displayed in Table 5 for multivariate Gaussian data and in Table 6 for multivariate Laplace data.

For multivariate Gaussian data, we see that SCAD thresholding gives superior performance in operator norm distance until $d=200$ where it gives comparable performance to the empirical diagonal matrix. At $d=500$ , the empirical diagonal now gives the best performance. In the case of multivariate Laplace data, the empirical diagonal outperforms all of the thresholding methods in all of the dimensions considered with respect to operator norm distance. It is worth noting that theoretical results for these threshold estimators were only demonstrated for sub-Gaussian data.

We understand that the performance of the threshold estimators improves asymptotically with increasing $n$ whereas the empirical diagonal will perform worse in the limit. The main point to make is that for fixed finite samples, as generally occur in practise, it is unwise to claim an estimator’s superiority based solely on the operator norm distance. Hence, we argue instead for support recovery of the true covariance matrix as the critical problem to solve in the context of high dimensional sparse covariance estimation.

Appendix D Derivations of Lipschitz constants

The following lemmas and propositions establish that specific functions used in the construction of confidence sets are, in fact, Lipschitz functions.

Lemma D.1.

Let $A$ and $B$ be two $d\times d$ real valued symmetric non-negative definite matrices. Then,

[TABLE]

where $\lVert\cdot\rVert_{1}$ is the trace class norm.

Proof.

By definition, $\lVert A\rVert_{1}=\mathrm{tr}\left((A^{*}A)^{1/2}\right)$ . If $A$ is symmetric and non-negative definite, then $(A^{*}A)^{1/2}=A$ . Hence, if $A$ and $B$ are symmetric and positive definite, then so is $A+B$ . Therefore,

[TABLE]

∎

Proposition D.2 (Lipschitz for $p=1$ ).

Assume that $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ and that $\mathrm{E}X_{i}=0$ for $i=1,\ldots,n$ . The function $\phi:\mathbb{R}^{d\times n}\rightarrow\mathbb{R}$ defined as

[TABLE]

is Lipschitz with constant $n^{-1/2}$ with respect to the metric $d_{(2,2)}({\bf X},{\bf Y})=\left(\sum_{i=1}^{n}\left\lVert X_{i}-Y_{i}\right\rVert_{\ell^{2}}^{2}\right)^{1/2}.$

Proof.

Let $X_{1},\ldots,X_{n},Y_{1},\ldots,Y_{n}\in\mathbb{R}^{d}$ with $\mathrm{E}X_{i}=\mathrm{E}Y_{i}=0$ for all $i$ and denote ${\bf X}=(X_{1},\ldots,X_{n})$ and ${\bf Y}=(Y_{1},\ldots,Y_{n})$ . Making use of Lemma D.1, we have

[TABLE]

∎

The next two lemmas are used to prove the Lipschitz constant for the $p$ -Schatten norms with $p=2$ and $p=\infty$ , respectively. The first lemma is reminiscent of the Cauchy-Schwarz inequality in the setting of the $2$ -Schatten norm.

Lemma D.3.

*Let $X_{1},\ldots,X_{n},Y_{1},\ldots,Y_{n}\in\mathbb{R}^{d}$ . Then, for the Frobenius norm, *

[TABLE]

Proof.

For any matrix $M\in\mathbb{R}^{d\times d}$ , we have that $\lVert M\rVert_{2}^{2}=\mathrm{tr}\left(M{M}^{\mathrm{T}}\right)$ . Hence, starting from the left hand side of the desired inequality and applying the Cauchy-Schwarz inequality gives us

[TABLE]

∎

Lemma D.4.

Let $X_{1},\ldots,X_{n},Y_{1},\ldots,Y_{n}\in\mathbb{R}^{d}$ . Then, for the operator norm,

[TABLE]

Proof.

Using the definition of the operator norm and the Cauchy-Schwarz inequality, we have that

[TABLE]

∎

Proposition D.5 (Lipschitz for $p=2$ or $p=\infty$ ).

Assume that $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ and that $\mathrm{E}X_{i}=0$ for $i=1,\ldots,n$ . Let $p\in[2,\infty]$ . The function $\phi:\mathbb{R}^{d\times n}\rightarrow\mathbb{R}$ defined as

[TABLE]

is Lipschitz with constant $n^{-1/2}$ with respect to the metric $d_{(2,2)}({\bf X},{\bf Y})=\left(\sum_{i=1}^{n}\left\lVert X_{i}-Y_{i}\right\rVert_{\ell^{2}}^{2}\right)^{1/2}.$

Proof.

To establish that $\phi$ is Lipschitz with the desired constant, we proceed by bounding the Gâteaux derivative. Let $p\in\{2,\infty\}$ .For $h\in\mathbb{R}$ and any $X_{1}\ldots,X_{n},Y_{1},\ldots,Y_{n}\in\mathbb{R}^{d}$ such that $\lVert\sum_{i=1}^{n}X_{i}{X_{i}}^{\mathrm{T}}\rVert_{p}\neq 0$ and $\lVert\sum_{i=1}^{n}Y_{i}{Y_{i}}^{\mathrm{T}}\rVert_{p}\neq 0$ ,

[TABLE]

where we used the facts that, for $M\in\mathbb{R}^{d\times d}$ , $\lVert M\rVert_{p}=\lVert{M}^{\mathrm{T}}\rVert_{p}$ , that

[TABLE]

and that

[TABLE]

Applying Lemma D.3 in the $p=2$ case and Lemma D.4 in the $p=\infty$ case shows that $\sqrt{n}d\phi(\cdot)\leq 1$ for all $X_{i}$ with $\left\lVert\sum_{i=1}^{n}X_{i}{X_{i}}^{\mathrm{T}}\right\rVert_{2}\neq 0$ . With application of the Mean Value Theorem, we have the desired Lipschitz constant.

In the case that $\left\lVert\sum_{i=1}^{n}X_{i}{X_{i}}^{\mathrm{T}}\right\rVert_{p}=0$ , we also achieve the same Lipschitz constant. Indeed, as $X_{i}{X_{i}}^{\mathrm{T}}$ is positive semi-definite, the norm can only be zero if all $X_{i}={(0,\ldots,0)}^{\mathrm{T}}$ . Hence, for any $Y_{1},\ldots,Y_{n}\in\mathbb{R}^{d}$ ,

[TABLE]

∎

It is conjectured that the function $\phi(\cdot)$ is 1-Lipschitz for all $p\in[1,\infty]$ , which follows immediately if Lemmas D.3 and D.4 can be expanded to similar results for all $p\in[1,\infty]$ .

Appendix E Concentration Results

The following is a brief expository section detailing results used and the associated references for the various concentration of measure tools used throughout this work. More details on these topics can be found in Ledoux [2001], Boucheron et al. [2013], Giné and Nickl [2016].

E.1 Concentration results for log concave measures

Gaussian concentration for log concave measures is established via the following theorems. In short, Theorem E.2 states that log concave measures satisfy a logarithmic Sobolev inequality, which bounds the entropy of the measure; see Definition E.1. Logarithmic Sobolev inequalities were first introduced in Gross [1975], and this result is due to Bakry and Émery [1984]. Following that, Theorem E.3 links the logarithmic Sobolev inequality with Gaussian concentration. Finally, Corollary E.4 extends this Gaussian concentration to product measures whose individual components satisfy logarithmic Sobolev inequalities in a dimension-free way due to the subadditivity of the entropy.

Definition E.1 (Entropy).

For a probability measure $\mu$ on a measurable space $(\Omega,\mathcal{F})$ and for any non-negative measurable function $f$ on $(\Omega,\mathcal{F})$ , the entropy is

[TABLE]

Theorem E.2 (Ledoux [2001], Theorem 5.2).

Let $\mu$ be strongly log-concave on $\mathbb{R}^{d}$ for some $c>0$ . Then, $\mu$ satisfies the logarithmic Sobolev inequality. That is, for all smooth $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ ,

[TABLE]

Theorem E.3 (Ledoux [2001], Theorem 5.3).

If $\mu$ is a probability measure on $\mathbb{R}^{d}$ such that $\mathrm{Ent}_{\mu}\left(f^{2}\right)\leq\frac{2}{c}\int\lvert\nabla f\rvert^{2}d\mu,$ then $\mu$ has Gaussian concentration. That is, Let $X\in\mathbb{R}^{d}$ be a random variable with law $\mu$ . Then, for all $1$ -Lipschitz functions $\phi:\mathbb{R}^{d}\rightarrow\mathbb{R}$ and for all $r>0$ ,

[TABLE]

Theorem E.4 (Ledoux [2001], Corollary 5.7).

Let $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ be random variables with measures $\mu_{1},\ldots,\mu_{n}$ , which are all strongly log-concave with coefficients $c_{1},\ldots,c_{n}$ . Let $\nu=\mu_{1}\otimes\ldots\otimes\mu_{n}$ be the product measure on $\mathbb{R}^{d\times n}$ . Then,

[TABLE]

Combining Theorems E.4 and E.3 immediately gives the following corollary.

Corollary E.5.

Let $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ have measures $\mu_{1},\ldots,\mu_{n}$ , which are all strongly log-concave with coefficients $c_{1},\ldots,c_{n}$ . Let $\nu=\mu_{1}\otimes\ldots\otimes\mu_{n}$ be the product measure on $\mathbb{R}^{d\times n}$ . Then, for any $1$ -Lipschitz $\phi:(\mathbb{R}^{d})^{n}\rightarrow\mathbb{R}$ and for any $r>0$ ,

[TABLE]

E.2 Concentration results for sub-exponential measures

If the log Sobolev inequality from above is replaced with the weaker spectral gap or Poincaré inequality, then we have the sub-exponential measures.

Theorem E.6 (Ledoux [2001], Corollary 5.15).

Let $X$ , a random variable on $\mathbb{R}^{d}$ with measure $\mu$ , satisfy the Poincaré inequality

[TABLE]

for some $C>0$ and for all locally Lipschitz functions $f$ . Then, for $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ iid copies of $X$ and for some Lipschitz function $\phi:\mathbb{R}^{d\times n}\rightarrow\mathbb{R}$ ,

[TABLE]

where $K>0$ in a constant depending only on $C$ and

[TABLE]

E.3 Concentration results for bounded random variables

The following results can be found in more depth in Giné and Nickl [2016] Section 3.3.4 and specifically in Example 3.3.13 (a). Theorem E.8 below is effectively a more general version of Hoeffding’s Inequality. To establish it, we begin with the definition of functions of bounded differences.

Definition E.7 (Functions of Bounded Differences).

A function $f:\mathbb{R}^{d\times n}\rightarrow\mathbb{R}$ is of bounded differences if

[TABLE]

Then, Gaussian concentration can be established for functions of bounded differences by the following theorem.

Theorem E.8.

Let $X_{1},\ldots,X_{n}\in\mathbb{R}^{d}$ and $Z=f(X_{1},\ldots,X_{n})$ where $f$ has bounded differences with $c=\sum_{i=1}^{n}c_{i}$ . Then, for all $r>0$ ,

[TABLE]

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bakry and Émery [1984] Dominique Bakry and Michel Émery. Hypercontractivité de semi-groupes de diffusion. Comptes rendus des séances de l’Académie des sciences. Série 1, Mathématique , 299(15):775–778, 1984.
2Bickel and Levina [2008 a] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics , pages 2577–2604, 2008 a.
3Bickel and Levina [2008 b] Peter J Bickel and Elizaveta Levina. Regularized estimation of large covariance matrices. The Annals of Statistics , pages 199–227, 2008 b.
4Bien and Tibshirani [2012] Jacob Bien and Rob Tibshirani. spcov: Sparse Estimation of a Covariance Matrix , 2012. URL https://CRAN.R-project.org/package=spcov . R package version 1.01.
5Bien and Tibshirani [2011] Jacob Bien and Robert J Tibshirani. Sparse estimation of a covariance matrix. Biometrika , 98(4):807–820, 2011.
6Bobkov and Ledoux [1997] Sergey Bobkov and Michel Ledoux. Poincaré’s inequalities and Talagrand’s concentration phenomenon for the exponential distribution. Probability Theory and Related Fields , 107(3):383–400, 1997.
7Boucheron et al. [2013] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence . Oxford University Press, 2013.
8Cai and Liu [2011] Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association , 106(494):672–684, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Nonasymptotic Estimation and Support Recovery

Abstract

1 Introduction

1.1 Notation and Definitions

Definition 1.1** (ppp-Schatten Norm).**

Definition 1.2** (Matrix Square Root).**

Definition 1.3** ((p,q)(p,q)(p,q)-Entrywise norm).**

1.2 Main contributions and connections to past work

2 Sparse Estimation Procedure

2.1 Concentration Confidence Set

2.2 Thresholding within confidence sets

Remark 2.1** (Positive Definite Estimators).**

Proposition 2.2**.**

Proof.

3 Fixing a false positive rate

Lemma 3.1**.**

Remark 3.2**.**

Theorem 3.1**.**

Remark 3.3**.**

Lemma 3.4**.**

4 Concentration Confidence Sets

4.1 Log Concave Measures

Definition 4.1** (Strongly log-concave measure).**

Theorem 4.1**.**

Remark 4.2**.**

Theorem 4.2**.**

Remark 4.3**.**

5 Numerical simulations

5.1 Multivariate Gaussian Data

5.2 Multivariate Laplace Data

5.3 Small Round Blue-Cell Tumour Data

6 Supplementary Material

Appendix A Sub-Exponential and Bounded Data

A.1 Sub-Exponential Distributions

A.2 Bounded Random Variables

Example A.1** (Distributions on the Hypercube).**

A.3 Simulations on High Dimensional Binary Vectors

Appendix B Proofs

Proof of Lemma 1.

Proof of Lemma 2.

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Theorem 3.

Appendix C Estimation with the Empirical Diagonal

Appendix D Derivations of Lipschitz constants

Lemma D.1**.**

Proof.

Proposition D.2** (Lipschitz for p=1p=1p=1).**

Proof.

Lemma D.3**.**

Proof.

Lemma D.4**.**

Proof.

Proposition D.5** (Lipschitz for p=2p=2p=2 or p=∞p=\inftyp=∞).**

Proof.

Appendix E Concentration Results

E.1 Concentration results for log concave measures

Definition E.1** (Entropy).**

Theorem E.2** (Ledoux [2001], Theorem 5.2).**

Theorem E.3** (Ledoux [2001], Theorem 5.3).**

Theorem E.4** (Ledoux [2001], Corollary 5.7).**

Corollary E.5**.**

E.2 Concentration results for sub-exponential measures

Theorem E.6** (Ledoux [2001], Corollary 5.15).**

E.3 Concentration results for bounded random variables

Definition E.7** (Functions of Bounded Differences).**

Theorem E.8**.**

Definition 1.1 ( $p$ -Schatten Norm).

Definition 1.2 (Matrix Square Root).

Definition 1.3 ( $(p,q)$ -Entrywise norm).

Remark 2.1 (Positive Definite Estimators).

Proposition 2.2.

Lemma 3.1.

Remark 3.2.

Theorem 3.1.

Remark 3.3.

Lemma 3.4.

Definition 4.1 (Strongly log-concave measure).

Theorem 4.1.

Remark 4.2.

Theorem 4.2.

Remark 4.3.

Example A.1 (Distributions on the Hypercube).

Lemma D.1.

Proposition D.2 (Lipschitz for $p=1$ ).

Lemma D.3.

Lemma D.4.

Proposition D.5 (Lipschitz for $p=2$ or $p=\infty$ ).

Definition E.1 (Entropy).

Theorem E.2 (Ledoux [2001], Theorem 5.2).

Theorem E.3 (Ledoux [2001], Theorem 5.3).

Theorem E.4 (Ledoux [2001], Corollary 5.7).

Corollary E.5.

Theorem E.6 (Ledoux [2001], Corollary 5.15).

Definition E.7 (Functions of Bounded Differences).

Theorem E.8.