Testing independence with high-dimensional correlated samples

Xi Chen; Weidong Liu

arXiv:1703.08843·math.ST·March 28, 2017

Testing independence with high-dimensional correlated samples

Xi Chen, Weidong Liu

PDF

Open Access

TL;DR

This paper introduces a simple, tuning-free test for independence among high-dimensional correlated samples, proves its optimality, and applies it to improve large-scale multiple testing of correlations.

Contribution

It proposes a novel independence test statistic for high-dimensional data with correlated samples and develops a de-correlation method for more accurate multiple testing.

Findings

01

The test statistic has a known limiting null distribution.

02

The method achieves minimax optimality in power.

03

The de-correlation approach effectively controls false discovery rates.

Abstract

Testing independence among a number of (ultra) high-dimensional random samples is a fundamental and challenging problem. By arranging $n$ identically distributed $p$ -dimensional random vectors into a $p \times n$ data matrix, we investigate the problem of testing independence among columns under the matrix-variate normal modeling of data. We propose a computationally simple and tuning-free test statistic, characterize its limiting null distribution, analyze the statistical power and prove its minimax optimality. As an important by-product of the test statistic, a ratio-consistent estimator for the quadratic functional of a covariance matrix from correlated samples is developed. We further study the effect of correlation among samples to an important high-dimensional inference problem --- large-scale multiple testing of Pearson's correlation coefficients. Indeed, blindly using classical…

Figures36

Click any figure to enlarge with its caption.

Tables8

Table 1. Table 1: Empirical type I error rates for testing independence based on 5,000 replications with α = 0.05 𝛼 0.05 \alpha=0.05

$n$	$𝚺$	$𝚿$	$p = 1, 000$	$p = 2, 000$	$p = 5, 000$	$p = 10, 000$
200	${0.2}^{\| i - j \|}$	$𝐈_{n \times n}$	0.046	0.046	0.042	0.043
	${0.5}^{\| i - j \|}$	$𝐈_{n \times n}$	0.040	0.049	0.049	0.050
	${0.8}^{\| i - j \|}$	$𝐈_{n \times n}$	0.045	0.048	0.055	0.058
	band	$𝐈_{n \times n}$	0.031	0.032	0.035	0.043
	block	$𝐈_{n \times n}$	0.014	0.025	0.030	0.035
500	${0.2}^{\| i - j \|}$	$𝐈_{n \times n}$	0.034	0.041	0.042	0.046
	${0.5}^{\| i - j \|}$	$𝐈_{n \times n}$	0.037	0.046	0.041	0.049
	${0.8}^{\| i - j \|}$	$𝐈_{n \times n}$	0.028	0.050	0.048	0.055
	band	$𝐈_{n \times n}$	0.032	0.035	0.038	0.040
	block	$𝐈_{n \times n}$	0.016	0.025	0.041	0.044
1000	${0.2}^{\| i - j \|}$	$𝐈_{n \times n}$	0.039	0.035	0.048	0.044
	${0.5}^{\| i - j \|}$	$𝐈_{n \times n}$	0.035	0.042	0.056	0.054
	${0.8}^{\| i - j \|}$	$𝐈_{n \times n}$	0.026	0.040	0.051	0.050
	band	$𝐈_{n \times n}$	0.029	0.037	0.040	0.045
	block	$𝐈_{n \times n}$	0.016	0.024	0.035	0.041

Table 2. Table 2: Comparison of empirical type I error rates for testing independence when p = 1000 𝑝 1000 p=1000 and α = 0.05 𝛼 0.05 \alpha=0.05

$n$	$𝚺$	$𝚿$	CV	Bai	CQ	${\hat{B}}_{n}$
50	${0.2}^{\| i - j \|}$	$𝐈_{n \times n}$	0.029	0.001	0.038	0.002
	${0.5}^{\| i - j \|}$	$𝐈_{n \times n}$	0.129	0.004	0.039	0.008
	${0.8}^{\| i - j \|}$	$𝐈_{n \times n}$	0.157	0.010	0.022	0.037
	band	$𝐈_{n \times n}$	0.072	0.005	0.033	0.007
	block	$𝐈_{n \times n}$	0.249	0.011	0.028	0.017
100	${0.2}^{\| i - j \|}$	$𝐈_{n \times n}$	0.070	0.017	0.041	0.028
	${0.5}^{\| i - j \|}$	$𝐈_{n \times n}$	0.118	0.020	0.036	0.034
	${0.8}^{\| i - j \|}$	$𝐈_{n \times n}$	0.101	0.019	0.025	0.049
	band	$𝐈_{n \times n}$	0.066	0.017	0.034	0.025
	block	$𝐈_{n \times n}$	0.068	0.017	0.023	0.016

Table 3. Table 3: Averaged FDP and power for testing correlations over 100 replications. Here, α = 0.05 𝛼 0.05 \alpha=0.05 and p = 1000 𝑝 1000 p=1000

$n$	$𝚺$	$𝚿$	$\sqrt{n} {\hat{ρ}}_{i j, Y}$		$\sqrt{n} {\hat{ρ}}_{i j}$		$\frac{\sqrt{n} {\hat{ρ}}_{i j}}{\sqrt{B_{n}}}$		True $𝚿^{- 1}$
			FDP	Power	FDP	Power	FDP	Power	FDP	Power
50	band	${0.2}^{\| i - j \|}$	0.010	0.311	0.027	0.339	0.010	0.276	0.015	0.339
	band	${0.5}^{\| i - j \|}$	0.009	0.308	0.403	0.378	0.000	0.000	0.015	0.340
	band	${0.8}^{\| i - j \|}$	0.009	0.292	0.986	0.648	0.000	0.000	0.014	0.338
	block	${0.2}^{\| i - j \|}$	0.011	0.262	0.036	0.416	0.014	0.295	0.021	0.407
	block	${0.5}^{\| i - j \|}$	0.011	0.257	0.366	0.504	0.000	0.000	0.021	0.410
	block	${0.8}^{\| i - j \|}$	0.012	0.168	0.965	0.739	0.010	0.000	0.021	0.408
100	band	${0.2}^{\| i - j \|}$	0.025	0.576	0.057	0.593	0.029	0.568	0.033	0.587
	band	${0.5}^{\| i - j \|}$	0.025	0.575	0.581	0.650	0.011	0.448	0.032	0.587
	band	${0.8}^{\| i - j \|}$	0.025	0.556	0.986	0.795	0.000	0.000	0.032	0.587
	block	${0.2}^{\| i - j \|}$	0.028	0.942	0.061	0.961	0.035	0.945	0.038	0.966
	block	${0.5}^{\| i - j \|}$	0.028	0.935	0.454	0.942	0.017	0.646	0.038	0.966
	block	${0.8}^{\| i - j \|}$	0.030	0.820	0.963	0.924	0.000	0.000	0.038	0.966
200	band	${0.2}^{\| i - j \|}$	0.036	0.839	0.072	0.852	0.039	0.820	0.041	0.854
	band	${0.5}^{\| i - j \|}$	0.036	0.835	0.620	0.867	0.028	0.640	0.041	0.853
	band	${0.8}^{\| i - j \|}$	0.041	0.749	0.984	0.906	0.002	0.240	0.042	0.852
	block	${0.2}^{\| i - j \|}$	0.034	1.000	0.071	1.000	0.041	1.000	0.043	1.000
	block	${0.5}^{\| i - j \|}$	0.034	1.000	0.498	1.000	0.033	0.992	0.044	1.000
	block	${0.8}^{\| i - j \|}$	0.040	0.969	0.962	0.994	0.004	0.233	0.044	1.000

Table 4. Table 4: Averaged FDP and power for testing correlations over 100 replications when samples are i.i.d. Here, α = 0.05 𝛼 0.05 \alpha=0.05 and p = 1000 𝑝 1000 p=1000 .

$n$	$𝚺$	$𝚿$	$\sqrt{n} {\hat{ρ}}_{i j, Y}$		$\sqrt{n} {\hat{ρ}}_{i j}$
			FDP	Power	FDP	Power
50	band	$𝐈_{n \times n}$	0.009	0.318	0.014	0.341
	block	$𝐈_{n \times n}$	0.012	0.293	0.021	0.408
100	band	$𝐈_{n \times n}$	0.025	0.578	0.032	0.587
	block	$𝐈_{n \times n}$	0.028	0.952	0.037	0.965
200	band	$𝐈_{n \times n}$	0.035	0.844	0.041	0.852
	block	$𝐈_{n \times n}$	0.035	1.000	0.043	1.000

Table 5. Table 5: Comparison of empirical type I error rates for testing independence when p = 1000 𝑝 1000 p=1000 and α = 0.05 𝛼 0.05 \alpha=0.05 when using simulation based critical values.

$n$	$𝚺$	$𝚿$	${\hat{B}}_{n}$
50	${0.2}^{\| i - j \|}$	$𝐈_{n \times n}$	0.004
	${0.5}^{\| i - j \|}$	$𝐈_{n \times n}$	0.013
	${0.8}^{\| i - j \|}$	$𝐈_{n \times n}$	0.047
	band	$𝐈_{n \times n}$	0.010
	block	$𝐈_{n \times n}$	0.025
100	${0.2}^{\| i - j \|}$	$𝐈_{n \times n}$	0.039
	${0.5}^{\| i - j \|}$	$𝐈_{n \times n}$	0.043
	${0.8}^{\| i - j \|}$	$𝐈_{n \times n}$	0.058
	band	$𝐈_{n \times n}$	0.031
	block	$𝐈_{n \times n}$	0.023

Table 6. Table 6: The number of rejections for the yeast data ( p = 1207 𝑝 1207 p=1207 genes). The density is computed by No. of Rejections ( p 2 ) No. of Rejections binomial 𝑝 2 \frac{\text{No. of Rejections}}{{p\choose 2}} .

	$\sqrt{n} {\hat{ρ}}_{i j, Y}$			$\sqrt{n} {\hat{ρ}}_{i j}$
	$α = 0.001$	$α = 0.01$	$α = 0.05$	$α = 0.001$	$α = 0.01$	$α = 0.05$
No. of Rejections	448	1062	2114	154072	220390	294810
Density (%)	0.07%	0.15%	0.29%	21.17%	30.28%	40.51%

Table 7. Table 7: The number of rejections for the stock data ( p = 258 𝑝 258 p=258 stocks). The density is computed by No. of Rejections ( p 2 ) No. of Rejections binomial 𝑝 2 \frac{\text{No. of Rejections}}{{p\choose 2}} .

	$\sqrt{n} {\hat{ρ}}_{i j, Y}$			$\sqrt{n} {\hat{ρ}}_{i j}$
	$α = 0.001$	$α = 0.01$	$α = 0.05$	$α = 0.001$	$α = 0.01$	$α = 0.05$
No. of Rejections	287	442	1034	873	1263	2751
Density (%)	0.87%	1.69%	3.12%	2.63%	4.58%	8.30%

Table 8. Table 8: Top 5 most correlated pairs of stocks. GICS stands for “global industry classification standard”.

Stock 1	GICS of Stock 1	Stock 2	GICS of Stock 2
American International Group	Financials	Citigroup	Financials
Corning	Industrials	Sealed Air Corp	Materials
Automatic Data Processing	Internet Software & Services	BMC Software	Internet Software & Services
Bank of America Corp	Financials	Citigroup	Financials
BMC Software	Internet Software & Services	McKesson	Health Care

Equations384

\displaystyle H_{0}:\quad\psi_{ij}=0\quad\mbox{for all $1\leq i<j\leq n$.}

\displaystyle H_{0}:\quad\psi_{ij}=0\quad\mbox{for all $1\leq i<j\leq n$.}

\displaystyle H_{0}:\quad\rho_{ij}=0\quad\mbox{for all $1\leq i<j\leq p$,}

\displaystyle H_{0}:\quad\rho_{ij}=0\quad\mbox{for all $1\leq i<j\leq p$,}

H_{0 ij} : ρ_{ij} = 0 \mbox v er s u s H_{1 ij} : ρ_{ij} \neq = 0.

H_{0 ij} : ρ_{ij} = 0 \mbox v er s u s H_{1 ij} : ρ_{ij} \neq = 0.

\frac{n ( ρ ^ _{ij} - ρ _{ij} )}{B _{n} ( 1 - ρ _{ij}^{2} )} \Rightarrow N (0, 1),

\frac{n ( ρ ^ _{ij} - ρ _{ij} )}{B _{n} ( 1 - ρ _{ij}^{2} )} \Rightarrow N (0, 1),

\hat{ψ}_{ij} = \frac{1}{p} k = 1 \sum p (X_{ik} - \overset{ˉ}{X}_{k}) (X_{j k} - \overset{ˉ}{X}_{k}), 1 \leq i, j \leq n .

\hat{ψ}_{ij} = \frac{1}{p} k = 1 \sum p (X_{ik} - \overset{ˉ}{X}_{k}) (X_{j k} - \overset{ˉ}{X}_{k}), 1 \leq i, j \leq n .

\displaystyle\hat{\psi}_{ij}=\frac{1}{p}\sum_{k=1}^{p}(X_{ik}-\mu_{k})(X_{jk}-\mu_{k})-\frac{1}{np}\sum_{k=1}^{p}\sigma_{kk}+O_{\mathbb{P}}\Big{(}\frac{1}{\sqrt{np}}\Big{)}.

\displaystyle\hat{\psi}_{ij}=\frac{1}{p}\sum_{k=1}^{p}(X_{ik}-\mu_{k})(X_{jk}-\mu_{k})-\frac{1}{np}\sum_{k=1}^{p}\sigma_{kk}+O_{\mathbb{P}}\Big{(}\frac{1}{\sqrt{np}}\Big{)}.

T_{ij} := \hat{ψ}_{ij} + \frac{1}{n p} k = 1 \sum p \overset{σ}{^}_{k k},

T_{ij} := \hat{ψ}_{ij} + \frac{1}{n p} k = 1 \sum p \overset{σ}{^}_{k k},

A_{p} = \frac{p ∥ Σ ∥ _{F}^{2}}{( t r ( Σ ) ) ^{2}}

A_{p} = \frac{p ∥ Σ ∥ _{F}^{2}}{( t r ( Σ ) ) ^{2}}

\mathbb{P}\Big{(}\frac{p}{A_{p}}\max_{1\leq i<j\leq n}\frac{T_{ij}^{2}}{\hat{\psi}_{ii}\hat{\psi}_{jj}}-4\log n+\log\log n\leq t\Big{)}\rightarrow\exp\left(-\frac{1}{\sqrt{8\pi}}\exp\Big{(}-\frac{t}{2}\Big{)}\right)

\mathbb{P}\Big{(}\frac{p}{A_{p}}\max_{1\leq i<j\leq n}\frac{T_{ij}^{2}}{\hat{\psi}_{ii}\hat{\psi}_{jj}}-4\log n+\log\log n\leq t\Big{)}\rightarrow\exp\left(-\frac{1}{\sqrt{8\pi}}\exp\Big{(}-\frac{t}{2}\Big{)}\right)

\hat{T}_{n, p} = \frac{p}{A ^ _{p}} 1 \leq i < j \leq n max \frac{T _{ij}^{2}}{ψ ^ _{ii} ψ ^ _{j j}} .

\hat{T}_{n, p} = \frac{p}{A ^ _{p}} 1 \leq i < j \leq n max \frac{T _{ij}^{2}}{ψ ^ _{ii} ψ ^ _{j j}} .

B_{n} = \frac{∥ Ψ ∥ _{F}^{2}}{n} = \frac{1}{n} 1 \leq i, j \leq n \sum ψ_{ij}^{2},

B_{n} = \frac{∥ Ψ ∥ _{F}^{2}}{n} = \frac{1}{n} 1 \leq i, j \leq n \sum ψ_{ij}^{2},

\overset{σ}{^}_{ij, t h r} = \overset{σ}{^}_{ij} I ⎩ ⎨ ⎧ \frac{∣ ρ ^ _{ij} ∣}{1 - ρ ^ _{ij}^{2}} \geq δ \frac{B ^ _{n} lo g p}{n} ⎭ ⎬ ⎫ for i \neq = j, \overset{σ}{^}_{ii, t h r} = \overset{σ}{^}_{ii} for 1 \leq i \leq p .

\overset{σ}{^}_{ij, t h r} = \overset{σ}{^}_{ij} I ⎩ ⎨ ⎧ \frac{∣ ρ ^ _{ij} ∣}{1 - ρ ^ _{ij}^{2}} \geq δ \frac{B ^ _{n} lo g p}{n} ⎭ ⎬ ⎫ for i \neq = j, \overset{σ}{^}_{ii, t h r} = \overset{σ}{^}_{ii} for 1 \leq i \leq p .

\hat{B}_{n} = \frac{1}{n} (∥ \hat{Ψ} ∥_{F}^{2} - \frac{1}{p} (t r (\hat{Ψ}))^{2}) .

\hat{B}_{n} = \frac{1}{n} (∥ \hat{Ψ} ∥_{F}^{2} - \frac{1}{p} (t r (\hat{Ψ}))^{2}) .

\hat{A}_{p} = \frac{p ∥ Σ ^ _{t h r} ∥ _{F}^{2}}{( t r ( Σ ^ _{t h r} ) ) ^{2}} .

\hat{A}_{p} = \frac{p ∥ Σ ^ _{t h r} ∥ _{F}^{2}}{( t r ( Σ ^ _{t h r} ) ) ^{2}} .

\hat{Σ}_{1} = (\overset{σ}{^}_{ij, 1}), \mbox w h er e \overset{σ}{^}_{ij, 1} = \overset{σ}{^}_{ij} I {∣ \overset{σ}{^}_{ij} ∣ \geq λ \frac{lo g p}{n}}, i \neq = j,

\hat{Σ}_{1} = (\overset{σ}{^}_{ij, 1}), \mbox w h er e \overset{σ}{^}_{ij, 1} = \overset{σ}{^}_{ij} I {∣ \overset{σ}{^}_{ij} ∣ \geq λ \frac{lo g p}{n}}, i \neq = j,

\displaystyle\mathbb{P}\Big{(}\frac{p}{A_{p}}\max_{1\leq i<j\leq n}\frac{T_{ij}^{2}}{\hat{\psi}_{ii}\hat{\psi}_{jj}}-4\log n+\log\log n\leq t\Big{)}\rightarrow\exp\left(-\frac{1}{\sqrt{8\pi}}\exp\Big{(}-\frac{t}{2}\Big{)}\right).

\displaystyle\mathbb{P}\Big{(}\frac{p}{A_{p}}\max_{1\leq i<j\leq n}\frac{T_{ij}^{2}}{\hat{\psi}_{ii}\hat{\psi}_{jj}}-4\log n+\log\log n\leq t\Big{)}\rightarrow\exp\left(-\frac{1}{\sqrt{8\pi}}\exp\Big{(}-\frac{t}{2}\Big{)}\right).

ρ_{ij} = \frac{\sum _{k = 1}^{n} ( X _{k i} - X ˉ _{i} ) ( X _{k j} - X ˉ _{j} )}{\sum _{k = 1}^{n} ( X _{k i} - X ˉ _{i} ) ^{2} \sum _{k = 1}^{n} ( X _{k j} - X ˉ _{j} ) ^{2}} .

ρ_{ij} = \frac{\sum _{k = 1}^{n} ( X _{k i} - X ˉ _{i} ) ( X _{k j} - X ˉ _{j} )}{\sum _{k = 1}^{n} ( X _{k i} - X ˉ _{i} ) ^{2} \sum _{k = 1}^{n} ( X _{k j} - X ˉ _{j} ) ^{2}} .

\displaystyle\mathbb{P}\Big{(}\hat{T}_{n,p}-4\log n+\log\log n\leq t\Big{)}\rightarrow\exp\Big{(}-\frac{1}{\sqrt{8\pi}}\exp\Big{(}-\frac{t}{2}\Big{)}\Big{)}

\displaystyle\mathbb{P}\Big{(}\hat{T}_{n,p}-4\log n+\log\log n\leq t\Big{)}\rightarrow\exp\Big{(}-\frac{1}{\sqrt{8\pi}}\exp\Big{(}-\frac{t}{2}\Big{)}\Big{)}

q_{α} = - lo g (8 π) - 2 lo g lo g (1 - α)^{- 1} .

q_{α} = - lo g (8 π) - 2 lo g lo g (1 - α)^{- 1} .

d_{ij, Ψ} = ψ_{ij} - \frac{\sum _{k = 1, \neq = i}^{n} ψ _{ik}}{n} - \frac{\sum _{k = 1, \neq = j}^{n} ψ _{j k}}{n} - \frac{\sum _{1 \leq i \neq = j \leq n} ψ _{ij}}{n ^{2} ( n - 1 )} .

d_{ij, Ψ} = ψ_{ij} - \frac{\sum _{k = 1, \neq = i}^{n} ψ _{ik}}{n} - \frac{\sum _{k = 1, \neq = j}^{n} ψ _{j k}}{n} - \frac{\sum _{1 \leq i \neq = j \leq n} ψ _{ij}}{n ^{2} ( n - 1 )} .

d_{n, Ψ} := 1 \leq i < j \leq n max ∣ d_{ij, Ψ} ∣.

d_{n, Ψ} := 1 \leq i < j \leq n max ∣ d_{ij, Ψ} ∣.

d_{n, Ψ} \geq δ \frac{A _{p} lo g n}{p} .

d_{n, Ψ} \geq δ \frac{A _{p} lo g n}{p} .

\displaystyle\mathcal{F}(\delta)=\Big{\{}\boldsymbol{\Psi}\succ 0:~{}\psi_{ii}=1,1\leq i\leq n\mbox{\quad and }d_{n,\boldsymbol{\Psi}}\geq\delta\sqrt{\frac{\log n}{p}}\Big{\}}.

\displaystyle\mathcal{F}(\delta)=\Big{\{}\boldsymbol{\Psi}\succ 0:~{}\psi_{ii}=1,1\leq i\leq n\mbox{\quad and }d_{n,\boldsymbol{\Psi}}\geq\delta\sqrt{\frac{\log n}{p}}\Big{\}}.

\overline{lim}_{(n, p) \to \infty} T_{α} \in T_{α} sup Ψ \in F (δ) in f P (T_{α} = 1) \leq 1 - β .

\overline{lim}_{(n, p) \to \infty} T_{α} \in T_{α} sup Ψ \in F (δ) in f P (T_{α} = 1) \leq 1 - β .

d_{n, Ψ} \geq (1 - \frac{3 s _{n}}{n}) 1 \leq i < j \leq n max ∣ ψ_{ij} ∣.

d_{n, Ψ} \geq (1 - \frac{3 s _{n}}{n}) 1 \leq i < j \leq n max ∣ ψ_{ij} ∣.

\displaystyle\max_{1\leq i<j\leq n}|\psi_{ij}|\geq\delta\sqrt{\frac{A_{p}\log n}{p}}\quad\mbox{for some $\delta>2$.}

\displaystyle\max_{1\leq i<j\leq n}|\psi_{ij}|\geq\delta\sqrt{\frac{A_{p}\log n}{p}}\quad\mbox{for some $\delta>2$.}

\displaystyle\mathcal{G}(a)=\Big{\{}\boldsymbol{\Psi}\succ 0:~{}\psi_{ii}=1,1\leq i\leq n\mbox{\quad and }\max_{1\leq i<j\leq n}|\psi_{ij}|\geq a\Big{\}}.

\displaystyle\mathcal{G}(a)=\Big{\{}\boldsymbol{\Psi}\succ 0:~{}\psi_{ii}=1,1\leq i\leq n\mbox{\quad and }\max_{1\leq i<j\leq n}|\psi_{ij}|\geq a\Big{\}}.

(1 - a^{2})^{- p /2} = o (n^{2}) \mbox a s n \to \infty,

(1 - a^{2})^{- p /2} = o (n^{2}) \mbox a s n \to \infty,

\overline{lim}_{n \to \infty} T_{α} \in T_{α} sup Ψ \in G (a) in f P (T_{α} = 1) \leq 1 - β .

\overline{lim}_{n \to \infty} T_{α} \in T_{α} sup Ψ \in G (a) in f P (T_{α} = 1) \leq 1 - β .

H_{0 ij} : ρ_{ij} = 0 \mbox v er s u s H_{1 ij} : ρ_{ij} \neq = 0, for 1 \leq i < j \leq p .

H_{0 ij} : ρ_{ij} = 0 \mbox v er s u s H_{1 ij} : ρ_{ij} \neq = 0, for 1 \leq i < j \leq p .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRandom Matrices and Applications · Statistical Methods and Inference · Statistical Methods and Bayesian Inference

Full text

Testing independence with high-dimensional correlated samples

Xi Chen label=e1][email protected] [

Weidong Liu label=e2][email protected] [ New York University\thanksmarkm1 and Shanghai Jiao Tong University\thanksmarkm2

Stern School of Business

New York University

44 West 4th Street

New York, NY, USA

Department of Mathematics

Institute of Natural Sciences and Moe-LSC

Shanghai Jiao Tong University

Shanghai, China

Abstract

Testing independence among a number of (ultra) high-dimensional random samples is a fundamental and challenging problem. By arranging $n$ identically distributed $p$ -dimensional random vectors into a $p\times n$ data matrix, we investigate the problem of testing independence among columns under the matrix-variate normal modeling of data. We propose a computationally simple and tuning-free test statistic, characterize its limiting null distribution, analyze the statistical power and prove its minimax optimality. As an important by-product of the test statistic, a ratio-consistent estimator for the quadratic functional of a covariance matrix from correlated samples is developed. We further study the effect of correlation among samples to an important high-dimensional inference problem — large-scale multiple testing of Pearson’s correlation coefficients. Indeed, blindly using classical inference results based on the assumed independence of samples will lead to many false discoveries, which suggests the need for conducting independence testing before applying existing methods. To address the challenge arising from correlation among samples, we propose a “sandwich estimator” of Pearson’s correlation coefficient by de-correlating the samples. Based on this approach, the resulting multiple testing procedure asymptotically controls the overall false discovery rate at the nominal level while maintaining good statistical power. Both simulated and real data experiments are carried out to demonstrate the advantages of the proposed methods.

62F05,

62H10 ,

Independence test,

multiple testing of correlations,

false discovery rate,

matrix-variate normal,

quadratic functional estimation,

high-dimensional sample correlation matrix,

keywords:

[class=MSC]

keywords:

,

t1Research supported by NSFC, Grants No.11322107 and No.11431006, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning, Shanghai Shuguang Program and 973 Program (2015CB856004).

1 Introduction

The independence among samples is a fundamental assumption in most statistical modeling upon which numerous estimation and inference methods and theories have been developed. Indeed, from classical statistical inference (e.g., student’s $t$ -test) to popular topics in modern statistics (e.g., high-dimensional problems, such as regression, matrix estimation and inference), this assumption of independence occurs widely. Consider $n$ samples $\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n}$ , where each sample is a $p$ -dimensional vector from the same population distribution with mean $\boldsymbol{\mu}\in\mathbb{R}^{p}$ and covariance ${\boldsymbol{\Sigma}}=(\sigma_{ij})_{p\times p}$ . It is often convenient to pool $n$ samples together to form a $p\times n$ data matrix $\textbf{X}=(\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n})$ . More specifically, for example in microarray data, X is an expression level matrix for $p$ genes measured on $n$ subjects. Such data are usually high-dimensional; thus, we mainly consider the setting where $p$ is much larger than $n$ . Most existing works in high-dimensional literature make the independence assumption among columns of X, serving as the starting point of methodology development and technical analysis. However, recent studies have shown that there are correlation structures among subjects in various microarray datasets (see, e.g., Teng and Huang (2009); Efron (2009); Allen and Tibshirani (2012); Kim et al. (2012)), demonstrating the potential risk of making the seemingly natural assumption of independence. Therefore, given a data matrix $\mathbf{X}$ , it is important to first test whether the samples are indeed independent before applying any method that assumes independence.

A data matrix $\textbf{X}=(\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n})$ is known as transposable data when both rows and columns are potentially correlated (Lazzeroni and Owen, 2002; Allen and Tibshirani, 2012). For a transposable data matrix X, it is commonly assumed that X follows a matrix-variate normal distribution, which has been widely applied to model microarray data (see, e.g., Teng and Huang (2009); Efron (2009); Muralidharan (2010); Allen and Tibshirani (2012); Yin and Li (2012); Kim et al. (2012); Zhou (2014)). The matrix-variate normal distribution is a natural generalization of familiar vector-variate normal distribution (Dawid, 1981). In particular, let $\text{vec}(\mathbf{X})\in\mathbb{R}^{np\times 1}$ be the vectorization of matrix $\mathbf{X}$ obtained by stacking the columns of $\mathbf{X}$ on top of each other. We say $\mathbf{X}\in\mathbb{R}^{p\times n}$ follows a matrix-variate normal distribution with the mean matrix $\textbf{M}\in\mathbb{R}^{p\times n}$ and covariance matrix ${\boldsymbol{\Sigma}}\otimes\boldsymbol{\Psi}\in\mathbb{R}^{np\times np}$ (denoted by $\textbf{X}\sim N(\textbf{M},{\boldsymbol{\Sigma}}\otimes\boldsymbol{\Psi})$ ) if and only if $\text{vec}(\mathbf{X}^{\prime})\sim N(\text{vec}(\textbf{M}^{\prime}),{\boldsymbol{\Sigma}}\otimes\boldsymbol{\Psi})$ . Here, $\textbf{X}^{\prime}$ denotes the transpose of X, $\otimes$ is the Kronecker product, and $\boldsymbol{\Psi}=(\psi_{ij})_{n\times n}\in\mathbb{R}^{n\times n}$ is the covariance matrix of row vectors of X. Given a matrix-variate normal $\textbf{X}\sim N(\textbf{M},{\boldsymbol{\Sigma}}\otimes\boldsymbol{\Psi})$ , each column $\boldsymbol{X}_{i}\sim N(\boldsymbol{M}_{i},\psi_{ii}{\boldsymbol{\Sigma}})$ for $1\leq i\leq n$ , where $\boldsymbol{M}_{i}$ is the $i$ -th column of the mean matrix M. Recall our problem setup: each $\boldsymbol{X}_{i}$ follows the same population distribution with mean vector $\boldsymbol{\mu}$ and covariance ${\boldsymbol{\Sigma}}$ . Thus, we have $\textbf{M}=\boldsymbol{\mu}\textbf{1}^{\prime}$ where 1 is the $n$ -dimensional all one column vector and $\psi_{ii}=1$ for $1\leq i\leq n$ . Under the matrix-variate normal modeling of the data, the independence testing problem is equivalent to the global test of whether $\boldsymbol{\Psi}$ is a diagonal matrix, i.e.,

[TABLE]

The testing problem in (1) is closely related to the following correlation test problem

[TABLE]

where $\rho_{ij}=\sigma_{ij}/\sqrt{\sigma_{ii}\sigma_{jj}}$ is the Pearson’s correlation coefficient. The testing problem in (2) is a classical problem in multivariate analysis (Nagao, 1973; Anderson, 2003). It has also been extensively studied in the past decade under the high-dimensional setting (e.g., Johnstone (2001), Ledoit and Wolf (2002), Jiang (2004), Schott (2005), Liu, Lin and Shao (2008), Bai et al. (2009), Cai and Jiang (2011), Jiang and Yang (2013), Han and Liu (2014)). However, the reported results are based on the assumption that samples are independent. In fact, our problem in (1) is equivalent to the testing problem (2) with correlated samples. To see this, note that when treating each row of X as an individual sample, the role of $\boldsymbol{\Psi}$ and ${\boldsymbol{\Sigma}}$ interchanges since $\textbf{X}^{\prime}\sim N(\mathbf{1}\boldsymbol{\mu}^{\prime},\boldsymbol{\Psi}\otimes{\boldsymbol{\Sigma}})$ , i.e., the matrix ${\boldsymbol{\Sigma}}$ models the correlations among row samples while $\boldsymbol{\Psi}$ becomes the population covariance matrix. For many types of data (e.g., genetic data, financial data), there exists a complicated correlation structure among $p$ variables. Thus, ${\boldsymbol{\Sigma}}$ will not be a diagonal matrix and row vectors are not independent. Our problem in (1) essentially tests the correlation among row vectors when samples are correlated. The correlation among samples makes our problem more challenging; and the aforementioned methods for testing (2), which are based on the assumption of sample independence, cannot be applied to our problem.

The classical methods for testing independence among samples commonly assume $p$ is fixed and are usually designed only for time series data. It is also known as serial independence test, see Hong (1998) and the references therein. In such a framework, the methods require that the samples under alternatives come from some time series. These samples satisfy an ordering structure such that the dependence between two samples decays as the distance of their indices increases. In our setting, there is no structural assumption among samples. Without any structural assumption, we will show in Theorem 2.7 that any test will not have the power tending to 1 uniformly over a large class of alternatives when the dimension $p$ is small (e.g., fixed constant or $p=o(\log n)$ ). On the other hand, for $p\geq c\log n$ but is small compared to $n$ , the independence test is relatively easy. In fact, if ${\boldsymbol{\Sigma}}$ is known, the data matrix can be transformed as ${\boldsymbol{\Sigma}}^{-1/2}\textbf{X}\sim N({\boldsymbol{\Sigma}}^{-1/2}\boldsymbol{\mu}\textbf{1}^{\prime},\textbf{I}_{p\times p}\otimes\boldsymbol{\Psi})$ ; and thus the independence test can be directly carried out using existing approaches (e.g., Jiang (2004); Liu, Lin and Shao (2008)). One can apply such an approach with a plug-in estimator $\widehat{{\boldsymbol{\Sigma}}^{-1}}$ . However, as we will explain later in Section 5, when $p\geq cn$ , even the optimal convergence rate of the estimator $\widehat{{\boldsymbol{\Sigma}}^{-1}}$ is not fast enough to solve this problem. In fact, although we have more information (i.e., row samples) as $p$ becomes larger, the number of unknown parameters in ${\boldsymbol{\Sigma}}$ increases accordingly, which makes the problem challenging. Therefore, the high-dimensional setting is the most interesting case, and will be the main focus of the paper.

Although the testing of independence among high-dimensional samples is an important and fundamental problem, few existing works have done so. Based on matrix-variate normal modeling of the data, some inference approaches were proposed by Efron (2009) and Muralidharan (2010). However, these works do not explore the limiting null distributions as well as the validity and power of the test. Pan, Gao and Yang (2014) proposed a statistic for this problem based on random matrix theory. However, it requires the condition that $p$ is proportionally as large as $n$ (i.e., $0<\lim_{n\rightarrow\infty}\frac{p}{n}<\infty$ ), and thus cannot be applied to cases where $p=n^{r}$ with $r>1$ or, as in the ultra high-dimensional setting, where $p=\exp(n^{\gamma})$ for some $0<\gamma<1$ ; both scenarios are common in genetic applications. Further, the method in Pan, Gao and Yang (2014) requires splitting $n$ samples into two parts and differences in splitting could lead to different test results.

In this paper, we consider the (ultra) high-dimensional setup and propose a minimax optimal test procedure in terms of the statistical power for the testing problem in (1). We show that the distribution of the proposed max-type test statistic converges to a type I extreme value distribution under the null (Theorem 2.4). Therefore, the proposed test has the pre-specified significance level asymptotically. We also investigate the statistical power. Roughly speaking, we show that under some very mild conditions on off-diagonal elements of $\boldsymbol{\Psi}$ , the power will converge to 1. Further, we prove that the proposed test is minimax rate-optimal over a large class of $\boldsymbol{\Psi}$ (Theorems 2.5 and 2.6).

Our construction of the test statistic combines a bias correction and a variance correlation based on the sample covariance matrix $(\hat{\psi}_{ij})_{n\times n}$ , where we treat each row of X as a sample. The bias correction technique allows us to handle the ultra high-dimensional case. Moreover, the variance correlation technique deals with the correlation structure among “row samples” of X, which is specified by ${\boldsymbol{\Sigma}}$ . To characterize the strength of correlation among row samples, we identify a key quantity $A_{p}=\frac{p\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}}{(tr({\boldsymbol{\Sigma}}))^{2}}$ , which comes from the asymptotic variance of our bias-corrected statistic. Here, $\|\cdot\|_{\text{F}}$ denotes the Frobenius norm and $tr({\boldsymbol{\Sigma}})$ is the trace of ${\boldsymbol{\Sigma}}$ . To simultaneously control the type I error under null and maintain the minimax rate-optimal statistical power, we need a ratio consistent estimator of $A_{p}$ regardless of the correlation among samples. Therefore, the remaining task essentially reduces to the problem of estimating $\|{\boldsymbol{\Sigma}}\|^{2}_{\text{F}}$ from correlated samples.

It is noteworthy that estimating $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ itself is an important problem, which is known as quadratic functional estimation of ${\boldsymbol{\Sigma}}$ (see, e.g., Bai and Saranadasa (1996); Chen and Qin (2010); Fan, Rigollet and Wang (2015)). Most existing works are based on the assumption that samples are independent and identically distributed (i.i.d.) and, thus, cannot be directly applied to our problem. Motivated by the thresholding estimator in Fan, Rigollet and Wang (2015), we propose a plugin estimator for $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ based on a thresholded sample covariance matrix but we relax the independence assumption among samples. Further, we propose a definite threshold level, which is adaptive to the amount of correlations among samples and guarantees the consistency of the resulting estimator. Our simulation results demonstrate the superior performance of the proposed estimator of $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ over the existing approaches, which leads to a significant improvement in statistical power.

In summary, we propose a simple max-type test statistic to conduct the global test of independence among high-dimensional random samples in (1). Our approach has the following advantages:

Our construction is direct and computationally attractive, which only requires the row sample covariance matrix $(\hat{\psi}_{ij})_{n\times n}$ and a threshold estimator of $\|{\boldsymbol{\Sigma}}\|_{\text{F}}$ . Further, our test statistic is completely tuning free. 2. 2.

The limiting null distribution is characterized and thus the type I error is controlled asymptotically. Further, our test procedure is minimax rate-optimal over a sufficiently large class of $\boldsymbol{\Psi}$ , which is enough for most practical purposes. 3. 3.

As an important by-product, we provide a ratio-consistent estimator for estimating quadratic functional of covariance matrix from correlated samples.

We would like to note that we only focus on the matrix-variate normal distribution, which is a common assumption for studying a transposable data matrix and widely used for modeling correlated microarray data. It is of interest to investigate the independence test for more general distributions, e.g., a matrix elliptical distribution (Dawid, 1977; Fang and Zhang, 1990) or $\mathbf{X}={\boldsymbol{\Sigma}}^{1/2}\textbf{Z}\boldsymbol{\Psi}^{1/2}$ , where entries of $\textbf{Z}=(Z_{ij})_{p\times n}$ are i.i.d. random variables with unit variance. We leave the extension to such distributions of data matrices for future work.

After we conduct the independence test, if the samples are indeed correlated, many classical inference approaches cannot be directly applied. We use the multiple testing problem of Pearson’s correlation coefficients to illustrate the effect of the correlation among samples, demonstrate the reason why the classical approach will fail when samples are correlated and further develop a new method to de-correlate the samples. In particular, we consider the following large-scale multiple testing problem, for $1\leq i<j\leq p$ ,

[TABLE]

Problem (3) is a natural extension of the global test of independence in (2). In fact, the hypothesis that ${\boldsymbol{\Sigma}}$ is a diagonal matrix is a strong null hypothesis, which will be rejected in most real data applications (e.g., microarray data, stock data). In contrast, the goal of the multiple testing problem (3) is to identity the pairs of correlated variables and thus find many applications in real data analysis, e.g., gene coexpression network analysis (Lee, Hsu and Sajdak, 2004; Carter et al., 2004; Zhu et al., 2005; Hirai, 2007), and brain connectivity analysis (Shaw, 2006). The goal of the testing problem in (3) is consistent with the goal of support recovery of a sparse ${\boldsymbol{\Sigma}}$ . The latter problem has been extensively studied in recent years (e.g., see Rothman, Levina and Zhu (2009), Lam and Fan (2009), Cai and Liu (2011), Bien and Tibshirani (2011)). These works establish consistency results of support recovery from independent samples under certain conditions, for example, all the absolute values of nonzero $\rho_{ij}$ are lower bounded by $C\sqrt{\frac{\log p}{n}}$ , which might be hard to hold in practice. Instead of trying to achieve the perfect support recovery, the multiple testing problem (3) has a more refined control of the type I error rate in support recovery under weaker assumptions. In particular, it usually aims to control the false discovery rate (FDR), which is a useful measure for evaluating the performance of support recovery. We also note that Cai and Liu (2015) recently studied problem (3) in a high-dimensional setting but it still requires the independence assumption.

For correlated samples from a matrix-variate normal distribution, we first establish the following result on the limiting distribution of the sample correlation coefficient $\hat{\rho}_{ij}$ (see Proposition 3.1):

[TABLE]

where $B_{n}=\frac{\|\boldsymbol{\Psi}\|_{\text{F}}^{2}}{n}$ , which quantifies the strength of the correlation among samples. Eq. (4) subsumes as a special case the classical results on the limiting distribution of $\hat{\rho}_{ij}$ when samples are i.i.d. ( $B_{n}=1$ in (4)) (see Theorem 4.2.4 in Anderson (2003)). When the correlation is strong to a certain extent such that $B_{n}>1+c$ for some constant $c>0$ , directly using sample correlation coefficient $\sqrt{n}\hat{\rho}_{ij}$ or Fisher’s $z$ statistic will lead to many false positives; this is verified by our simulations in Section 4.2. In fact, even if $B_{n}$ is known and one uses the correct limiting null distribution $N\left(0,\frac{B_{n}}{n}\right)$ of $\hat{\rho}_{ij}$ , the variance of $\hat{\rho}_{ij}-\rho_{ij}$ becomes larger as $B_{n}$ increases, which leads to a lower power of the test.

To overcome the side effect of correlation among samples, we propose a “sandwich estimator” of $\rho_{ij}$ by de-correlating the samples, which has the limiting distribution $N(\rho_{ij},\frac{1}{n}(1-\rho^{2}_{ij})^{2})$ . The corresponding asymptotical variance does not depend on $B_{n}$ and is smaller than that of the naïve estimator $\hat{\rho}_{ij}$ . Therefore the proposed “sandwich estimator” has an improved statistical power especially when the correlation among samples is strong. Based on the proposed “sandwich estimator” of $\rho_{ij}$ , the standard multiple testing procedure (Benjamini and Hochberg, 1995) is proven to asymptotically control the FDR at the nominal level (see Theorem 3.2).

Finally, we introduce some necessary notations. For a positive integer $p$ , $[p]:=\{1,\ldots,p\}$ . For a square matrix $\mathbf{A}$ , let $tr(\mathbf{A})$ denote the trace of $\mathbf{A}$ , $\lambda_{\max}(\mathbf{A})$ the maximum eigenvalue of $\mathbf{A}$ , and $\lambda_{\min}(\mathbf{A})$ the minimum eigenvalue of $\mathbf{A}$ . Let $I\{B\}$ be the indicator function that takes value one when the event $B$ is true and zero otherwise. For a given set $\mathcal{H}$ , let $Card(\mathcal{H})$ be the cardinality of $\mathcal{H}$ . For any two real numbers $a$ and $b$ , let $a\vee b=\max(a,b)$ and $a\wedge b=\min(a,b)$ . We use $\mathop{\overline{\rm lim}}$ and $\mathop{\underline{\rm lim}}$ to denote limit superior and limit inferior, respectively. Throughout the paper, we use $\mathbf{I}_{p\times p}$ to denote the $p\times p$ identity matrix, and use $C$ , $c$ , $c_{1}$ , etc. to denote constants for which values might change from place to place and do not depend on $n$ and $p$ .

The rest of the paper is organized as follows. In Section 2, we study the global test in (1). The test statistic is proposed in Section 2.1. In Section 2.2, we provide the ratio consistent estimator of $A_{p}$ and $\|{\boldsymbol{\Sigma}}\|_{\text{F}}$ from correlated samples. The estimation error is characterized in Theorem 2.1. We further provide the limiting null distribution of the test statistic and the power analysis (Theorems 2.4–2.7). Section 3 studies the multiple testing of correlations in (3) from correlated samples. Experimental results are given in Section 4 followed by discussion in Section 5. The proofs of our results as well as some additional experimental results are provided in Appendix.

2 Sample independence test

We study the global testing problem of sample independence in (1) given the $p\times n$ data matrix $\textbf{X}=(\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n})\sim N(\boldsymbol{\mu}\mathbf{1}^{\prime},{\boldsymbol{\Sigma}}\otimes\boldsymbol{\Psi})$ .

2.1 Construction of the test statistic

Recall that $\boldsymbol{X}_{i}=(X_{i1},\ldots,X_{ip})^{\prime}$ denotes the $i$ -th sample for $1\leq i\leq n$ and let $\bar{\boldsymbol{X}}=\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}_{i}=:(\bar{X}_{1},\ldots,\bar{X}_{p})^{\prime}$ . Define

[TABLE]

In fact, from the proof, the statistic $(\hat{\psi}_{ij})_{n\times n}$ is the sample covariance coefficient corresponding to $\frac{tr({\boldsymbol{\Sigma}})}{p}\psi_{ij}$ . Further, under the null $H_{0}$ , we can show that

[TABLE]

The first term $\frac{1}{p}\sum_{k=1}^{p}(X_{ik}-\mu_{k})(X_{jk}-\mu_{k})$ has mean $\frac{tr({\boldsymbol{\Sigma}})}{p}\psi_{ij}$ and variance $\frac{\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}}{p^{2}}(\psi_{ii}\psi_{jj}+\psi_{ij}^{2})$ . The bias term $\frac{1}{np}\sum_{k=1}^{p}\sigma_{kk}$ comes from the centralization statistics $\{\bar{X}_{k}\}_{k=1}^{p}$ in (5). When $p=o(n^{2})$ , we have $\frac{1}{np}\sum_{k=1}^{p}\sigma_{kk}=o(1/\sqrt{p})$ and $\sqrt{p}\hat{\psi}_{ij}$ can be shown to converge to a normal distribution. However, as we are interested in the ultra high-dimensional case where $p$ can be as large as $\exp(o(n^{\gamma}))$ for some $0<\gamma<1$ , when $p$ becomes larger such that $n^{2}=o(p)$ , $\sqrt{p}\hat{\psi}_{ij}\rightarrow-\infty$ in probability under the null. To enable the applicability of our test statistic in the ultra high-dimensional setting, we first propose the following bias corrected quantity:

[TABLE]

where $\hat{\sigma}_{kk}=\frac{1}{n-1}\sum_{j=1}^{n}(X_{jk}-\bar{X}_{k})^{2}$ is the sample variance corresponding to $\sigma_{kk}$ . Since the first term in (6) has variance $\frac{\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}}{p^{2}}(\psi_{ii}\psi_{jj}+\psi_{ij}^{2})$ , the asymptotic variance of $T_{ij}$ is $\left(\frac{tr({\boldsymbol{\Sigma}})}{p}\right)^{2}\frac{A_{p}}{p}$ , where

[TABLE]

quantifies the strength of correlations among row vectors of X.

Given $A_{p}$ in (8), we will show that under the null as $(n,p)\rightarrow\infty$ ,

[TABLE]

for $t\in\mathbb{R}$ , where the term $A_{p}$ plays the role of variance correction for $T_{ij}$ . The remaining task is to develop a ratio consistent estimator $\hat{A}_{p}$ for $A_{p}$ . In addition, to maintain the statistical power, the estimator $\hat{A}_{p}$ should also be consistent for correlated samples. In Section 2.2, we will develop such an estimator for ${A}_{p}$ . Given the estimator $\hat{A}_{p}$ (see (14)), we propose the following test statistic for the independence test in (1),

[TABLE]

2.2 Estimation of $A_{p}$ and $\|{{\boldsymbol{\Sigma}}}\|^{2}_{\text{F}}$ from correlated samples

The estimation of $\|{\boldsymbol{\Sigma}}\|^{2}_{\text{F}}$ finds many applications and has been studied in several works (Bai and Saranadasa, 1996; Chen and Qin, 2010; Fan, Rigollet and Wang, 2015). However, all these works rely on the sample independence assumption. In particular, Fan, Rigollet and Wang (2015) proved that the simple plug-in procedure based on threshold estimators are minimax optimal over a large class of covariance matrices. Moreover, the threshold level in Fan, Rigollet and Wang (2015) takes the form of $C\sqrt{\frac{\log p}{n}}$ , where the constant $C$ needs to be carefully tuned to achieve good performance in practice. A cross-validation (CV) procedure was suggested; however, there is no theoretical justification for such a CV procedure. In this section, we introduce a threshold estimator for $\|{\boldsymbol{\Sigma}}\|^{2}_{\text{F}}$ with an explicit threshold level, which is completely data-driven without any tuning and automatically adaptive to the correlation among samples. We will show in Theorem 2.1 that the obtained estimator is ratio-consistent for correlated samples.

Let us define the (column) sample covariance matrix $\hat{{\boldsymbol{\Sigma}}}=(\hat{\sigma}_{ij})_{1\leq i,j\leq p}$ with $\hat{\sigma}_{ij}=\frac{1}{n-1}\sum_{k=1}^{n}(X_{ki}-\bar{X}_{i})(X_{kj}-\bar{X}_{j})$ and sample correlation coefficient $\hat{\rho}_{ij}=\hat{\sigma}_{ij}/\sqrt{\hat{\sigma}_{ii}\hat{\sigma}_{jj}}$ for $1\leq i,j\leq p$ . Further, define

[TABLE]

which quantifies the average correlation among samples. It can be shown that $\frac{\hat{\rho}_{ij}-\rho_{ij}}{\sqrt{B_{n}}(1-\hat{\rho}_{ij}^{2})}\Rightarrow N(0,1)$ (see Proposition 3.1 in Section 3 and note that $\hat{\rho}_{ij}\rightarrow\rho_{ij}$ in probability). We propose the following threshold estimator $\hat{{\boldsymbol{\Sigma}}}_{thr}=(\hat{\sigma}_{ij,thr})_{1\leq i,j\leq p}$ , where

[TABLE]

Here, $\hat{B}_{n}$ is an estimator of $B_{n}$ and $\delta$ can be any constant larger than $\sqrt{2}$ . Let $\hat{\boldsymbol{\Psi}}=(\frac{p}{tr(\hat{{\boldsymbol{\Sigma}}})}\hat{\psi}_{ij})_{1\leq i,j\leq n}$ . Using the approach from Bai and Saranadasa (1996), we construct

[TABLE]

Given the threshold estimator $\hat{{\boldsymbol{\Sigma}}}_{thr}=(\hat{\sigma}_{ij,thr})_{1\leq i,j\leq p}$ in (12), the $\|{\boldsymbol{\Sigma}}\|^{2}_{\text{F}}$ is estimated by $\|\hat{{\boldsymbol{\Sigma}}}_{thr}\|^{2}_{\text{F}}$ and $A_{p}$ is estimated by

[TABLE]

Now we will show that $\|\hat{{\boldsymbol{\Sigma}}}_{thr}\|^{2}_{\text{F}}$ and $\hat{A}_{p}$ are ratio-consistent estimators of $\|{\boldsymbol{\Sigma}}\|_{\text{F}}$ and $A_{p}$ , respectively. We first make the following three assumptions throughout this section. Let $\lambda_{\min}({\boldsymbol{\Sigma}})=\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{p}=\lambda_{\max}({\boldsymbol{\Sigma}})$ be the eigenvalues of ${\boldsymbol{\Sigma}}$ and $\lambda_{\min}(\boldsymbol{\Psi})=\nu_{1}\leq\nu_{2}\leq\cdots\leq\nu_{n}=\lambda_{\max}(\boldsymbol{\Psi})$ be eigenvalues of $\boldsymbol{\Psi}$ . We make the following standard assumption on eigenvalues:

(C1) We assume that $c^{-1}\leq\lambda_{\min}({\boldsymbol{\Sigma}})\leq\lambda_{\max}({\boldsymbol{\Sigma}})\leq c$ and $c^{-1}\leq\lambda_{\min}(\boldsymbol{\Psi})\leq\lambda_{\max}(\boldsymbol{\Psi})\leq c$ for some constant $c>0$ .

The condition (C1) is a typical eigenvalue assumption in high-dimensional covariance estimation literature (see the survey Cai, Ren and Zhou (2016) and references therein). This assumption is natural for many important classes of covariance matrices, e.g., bandable, Toeplitz, and sparse covariance matrices. There are cases that the assumption (C1) is violated, e.g., when the covariance matrix has equal correlation structure (i.e., ${\boldsymbol{\Sigma}}=\rho\cdot\mathbf{1}\mathbf{1}^{\prime}+(1-\rho)\cdot\mathbf{I}_{p\times p}$ for some $\rho\in(0,1)$ ). Our result will not hold for such a setting and please refer to Figure 3 for the experimental illustrations.

We also note that this condition can be weakened by replacing the constant $c$ by some $c_{p}\rightarrow\infty$ at a certain rate. However, for the sake of simplicity, we do not intend to seek the optimal rate of $c_{p}$ . We only mention that this type of constraint on eigenvalues is needed in our problem. Without this type of constraints, $T_{ij}$ in (7) will no longer be asymptotic normal because the Lindeberg’s condition for the central limit theorem (CLT) of independent random variables (see the expression of $\hat{\psi}_{ij}$ in Eq. (67) in Appendix) is violated. Thus, our result on type I error rate control in Proposition 2.3 will no longer hold.

The second condition is also a standard assumption on the norm of each row of $\boldsymbol{\Psi}$ and ${\boldsymbol{\Sigma}}$ .

(C2) For some $0<\tau<2$ , assume that $\sum_{k=1}^{n}|\psi_{ik}|^{\tau}\leq C$ uniformly over each row $1\leq i\leq n$ and $\sum_{k=1}^{p}|\sigma_{jk}|^{\tau}\leq C$ uniformly over each row $1\leq j\leq p$ .

Notably, the upper bounds on eigenvalues of ${\boldsymbol{\Sigma}}$ and $\boldsymbol{\Psi}$ in (C1) only imply the $\ell_{2}$ -boundedness of each row of $\boldsymbol{\Psi}$ and ${\boldsymbol{\Sigma}}$ , i.e., $\sum_{k=1}^{n}|\psi_{ik}|^{2}\leq c^{2}$ and $\sum_{k=1}^{p}|\sigma_{jk}|^{2}\leq c^{2}$ . The condition (C2) is stronger than this implication by noticing that $0<\tau<2$ . Moreover, when $0<\tau<1$ , this assumption becomes the typical weak sparsity assumption in high-dimensional covariance estimation.

The third assumption is on the relationship between $n$ and $p$ .

(C3) We assume that $p>cn$ for some universal constant $c>0$ that does not depend on $p$ and $n$ . We further assume that $p=\exp(o(n^{\gamma}))$ with $\gamma=(1-\epsilon)\wedge(\frac{2}{\tau}-1)$ for some $\epsilon>0$ .

The first condition $p=p_{n}>cn$ is quite natural in a high-dimensional setting and the second condition $p=\exp(o(n^{\gamma}))$ allows us to deal with an ultra high-dimensional setting.

Under these three assumptions, we provide the following theorem, which establishes the ratio consistency of the estimators $\hat{A}_{p}$ and $\|\hat{{\boldsymbol{\Sigma}}}_{thr}\|^{2}_{\text{F}}$ .

Theorem 2.1.

Assume that (C1)-(C3) hold. For any $\delta>\sqrt{2}$ , we have $\frac{\hat{A}_{p}}{A_{p}}=1+O_{\mathbb{P}}\Big{(}\Big{(}\sqrt{\frac{\log p}{n}}\Big{)}^{\min(1,2-\tau)}\Big{)}$ and $\frac{\|\hat{{\boldsymbol{\Sigma}}}_{thr}\|^{2}_{\text{F}}}{\|{\boldsymbol{\Sigma}}\|^{2}_{\text{F}}}=1+O_{\mathbb{P}}\Big{(}\Big{(}\sqrt{\frac{\log p}{n}}\Big{)}^{\min(1,2-\tau)}\Big{)}$ .

According to Theorem 2.1, we will simply set $\delta=1.42$ in the estimator $\hat{A}_{p}$ in our experiment. In fact, the experimental results are quite robust with respect to the choice of $\delta$ . As long as the $\delta$ is above $\sqrt{2}$ and does not take a too large value, the experimental results will not be affected.

Due to the term $\hat{B}_{n}$ in the thresholding level, our estimator is adaptive to the correlations between the samples. We next show that, even when ${\boldsymbol{\Sigma}}=\mathbf{I}_{p\times p}$ , if we use the thresholding level designed for i.i.d. samples without $\hat{B}_{n}$ as in Fan, Rigollet and Wang (2015), the resultant estimator $\tilde{A}_{p}$ will over-estimate $A_{p}$ and, hence, reduce the power. In particular, define the thresholding estimator

[TABLE]

and $\hat{\sigma}_{ii,1}=\hat{\sigma}_{ii}$ . Fan, Rigollet and Wang (2015) showed that, under the i.i.d. assumption, for a large constant-valued $\lambda$ (not depending on $\boldsymbol{\Psi}$ ), $\sum_{i\neq j}\hat{\sigma}^{2}_{ij,1}$ attains the minimax-optimal rate for estimating $\sum_{i\neq j}\sigma^{2}_{ij}$ . Let $\tilde{A}_{p}=\frac{p\|\hat{{\boldsymbol{\Sigma}}}_{1}\|^{2}_{\text{F}}}{(tr(\hat{{\boldsymbol{\Sigma}}}_{1}))^{2}}$ . When the samples are correlated, $\|\hat{{\boldsymbol{\Sigma}}}_{1}\|_{\text{F}}$ is no longer a ratio consistent estimator for $\|{\boldsymbol{\Sigma}}\|_{\text{F}}$ and hence results in a poor estimator for $A_{p}$ .

Proposition 2.2.

Assume that ${\boldsymbol{\Sigma}}=\mathbf{I}_{p\times p}$ and (C1)-(C3) hold. For any $\lambda>0$ and $\nu>0$ , there is a class of covariance matrices $\boldsymbol{\Psi}$ with $B_{n}\geq 5\lambda^{2}/\nu$ such that $\mathbb{P}(\tilde{A}_{p}/A_{p}\geq 1+cp^{1-\nu}/n)\rightarrow 1$ as $(n,p)\rightarrow\infty$ .

Proposition 2.2 shows that $\tilde{A}_{p}$ will over-estimate $A_{p}$ when $p\gg n$ . If $\tilde{A}_{p}$ is used to estimate $A_{p}$ , then the resultant testing approach will be less powerful than the test with our estimator $\hat{A}_{p}$ . We will further show the impact of $\tilde{A}_{p}$ on the power in the simulation.

2.3 Type I error rate control and optimality of statistical power

The following proposition gives the limiting distribution of $T_{ij}$ .

Proposition 2.3.

Assume that $p\geq cn$ for some constant $c>0$ (which does not depend on $n$ and $p$ ) and (C1) holds. Under the null $H_{0}$ , for $t\in\mathbb{R}$ , we have as $(n,p)\rightarrow\infty$

[TABLE]

In Proposition 2.3, the test statistic $\frac{T_{ij}}{\sqrt{\hat{\psi}_{ii}\hat{\psi}_{jj}}}$ can be viewed as a sample correlation coefficient related with $\psi_{ij}$ . We first note that Proposition 2.3 cannot be implied by Theorem 4 in Cai and Jiang (2011). Let us denote the sample correlation coefficient by

[TABLE]

Cai and Jiang (2011) established the limiting distribution of $\max_{|i-j|\geq\tau}|\hat{\rho}_{ij}|$ for $\tau\geq 1$ . Their result requires that $n$ random vectors $(X_{ki},X_{kj})$ for $1\leq k\leq n$ in the sum $\sum_{k=1}^{n}(X_{ki}-\bar{X}_{i})(X_{kj}-\bar{X}_{j})$ in (15) are i.i.d. On the contrary, our statistic $T_{ij}$ is based on $\sum_{k=1}^{p}X_{ik}X_{jk}$ , which is a sum of $p$ potentially correlated random variables, no matter under the null or alternatives.

In addition, it is worthwhile to note that Cai and Jiang (2012) revealed an interesting phase transition phenomenon in the limiting distribution of the largest off-diagonal entry of the sample correlation matrix. There are different regimes for large $p$ , in which the limiting distributions are different. In contrast, in our problem, there is no such a phase transition phenomenon and the limiting distribution is unified in the high-dimensional setting when $p\geq cn$ . To see this more clearly, let us assume that $\textbf{X}^{(k)}$ for $k=1,\ldots,p,$ are independent so that the results in Cai and Jiang (2012) are valid. Now, the quantity $p$ is the sample size and $n$ is the dimension. According to Corollary 2.2 of Cai and Jiang (2012), there is a phase transition phenomenon for the distribution of the statistic in Proposition 2.3 (i.e., $\frac{p}{A_{p}}\max_{1\leq i<j\leq n}\frac{T_{ij}^{2}}{\hat{\psi}_{ii}\hat{\psi}_{jj}}-4\log n+\log\log n$ ) between two regimes $\frac{1}{\sqrt{p}}\log n\rightarrow 0$ and $\frac{1}{\sqrt{p}}\log n\rightarrow\alpha\in(0,\infty)$ . In our high-dimensional setting, we have $p\geq cn$ , which belongs to the first regime $\frac{1}{\sqrt{p}}\log n\rightarrow 0$ . Thus, there is no phase transition phenomenon in the high-dimensional setting.

Using Theorem 2.1, we provide the limiting null distribution of our test statistic $\hat{T}_{n,p}$ in the next theorem.

Theorem 2.4.

Assume that (C1)-(C3) hold. Under the null $H_{0}$ , we have

[TABLE]

for $t\in\mathbb{R}$ , as $(n,p)\rightarrow\infty$ .

Remark: In Theorem 2.4, we need the additional assumption $p=\exp(o(n^{\gamma}))$ in (C3), which is used to obtain a ratio-consistent estimator of $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ and $A_{p}$ . If we consider only the limiting distribution of the test statistic under the null (i.i.d. samples), one may use the method from Chen and Qin (2010) to estimate $\|{\boldsymbol{\Sigma}}\|_{\text{F}}$ . The estimator from Chen and Qin (2010) does not require the condition $p=\exp(o(n^{\gamma}))$ . However, in terms of statistical power, as we have shown in our simulations, their estimator will over-estimate $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ (see Figure 6 in Appendix) and reduce the power (see Figure 1) especially when the correlation among samples is strong. Our estimator is ratio-consistent for both null and alternative (see Theorem 2.1) under the extra condition $p=\exp(o(n^{\gamma}))$ . For the thresholding estimator, such a condition on $p$ is necessary. To see this, if $\log p$ is much larger than $n$ , then the thresholding level in (12) is much larger than one. Thus, $\hat{{\boldsymbol{\Sigma}}}_{thr}$ becomes diag $(\hat{{\boldsymbol{\Sigma}}})$ and $\|\hat{{\boldsymbol{\Sigma}}}_{thr}\|_{\text{F}}^{2}$ will no longer be consistent. As a future direction, it would be interesting to construct a consistent estimator for $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ and $A_{p}$ under the null and alternative simultaneously without the restriction on $p$ .

According to Theorem 2.4, for a given significance level $0<\alpha<1$ , we reject the null hypothesis whenever $\hat{T}_{n,p}\geq q_{\alpha}+4\log n-\log\log n,$ where $q_{\alpha}$ is the $1-\alpha$ quantile of the type I extreme value distribution with the cumulative distribution function (CDF) $\exp\left(-\frac{1}{\sqrt{8\pi}}\exp\Big{(}-\frac{x}{2}\Big{)}\right)$ , i.e.,

[TABLE]

Theorem 2.4 shows that the proposed test statistic controls the type I error rate at the nominal level asymptotically.

We now turn to the power analysis. For a given pair of $1\leq i<j\leq n$ , let us define

[TABLE]

and

[TABLE]

The next theorem shows that for a large class of $\boldsymbol{\Psi}$ , the null hypothesis will be rejected by our test with probability tending to one.

Theorem 2.5.

Assume that (C1)-(C3) hold and suppose that for some $\delta>2$ and all large enough $n$ and $p$ ,

[TABLE]

We have $\mathbb{P}\Big{(}\hat{T}_{n,p}-4\log n+\log\log n\geq q_{\alpha}\Big{)}\rightarrow 1$ as $(n,p)\rightarrow\infty$ .

We next show that our test statistic is minimax rate optimal for statistical power even when $\boldsymbol{\mu}$ and ${\boldsymbol{\Sigma}}$ are known. To this end, we introduce a class of covariance matrix for $\boldsymbol{\Psi}$ — $\mathcal{F}(\delta)$ for some $\delta>0$ as follows:

[TABLE]

Let $\mathcal{T}_{\alpha}$ be the set of $\alpha$ -level tests with $\boldsymbol{\mu}$ and ${\boldsymbol{\Sigma}}$ being known, i.e. $\mathcal{T}_{\alpha}=\{T_{\alpha}:\mathbb{P}(T_{\alpha}=1|H_{0})\leq\alpha\}.$ Here $T_{\alpha}=1$ means the rejection of $H_{0}$ .

Theorem 2.6.

Let $\alpha,\beta>0$ and $\alpha+\beta<1$ . Assume that (C3) holds. For any $\delta<2$ , we have

[TABLE]

Theorem 2.6 shows that for any $\alpha$ -level test $T_{\alpha}$ and any $\delta<2$ , there must exist a covariance matrix $\boldsymbol{\Psi}\in\mathcal{F}(\delta)$ such that the probability of rejecting the null is less than $\alpha+\varepsilon$ asymptotically for any $\varepsilon>0$ . Theorems 2.5 and 2.6 together show that the proposed test based on $\hat{T}_{n,p}$ is minimax rate optimal by noting that $1\leq A_{p}\leq C$ for some constant $C>0$ according to the condition (C1). In other words, the order of the lower bound $\sqrt{\frac{\log n}{p}}$ on $d_{n,\boldsymbol{\Psi}}$ cannot be improved, which establishes the minimax-optimal rate for the test. Moreover, when ${\boldsymbol{\Sigma}}=\textbf{I}_{p\times p}$ , we have $A_{p}=1$ and, hence, our test statistic is also minimax constant optimal.

We further show that (20) is a rather wide class of $\boldsymbol{\Psi}$ in the sense that if (20) does not hold, it will be safe to assume the independence for some applications. In particular, assume that $\boldsymbol{\Psi}$ is a $s_{n}$ sparse matrix, i.e., the number of nonzero elements in each row of $\boldsymbol{\Psi}$ is bounded from above by $s_{n}$ . Then by (18),

[TABLE]

Thus, a sufficient condition for (20) to hold is $s_{n}=o(n)$ and

[TABLE]

Theorem 2.5 shows that under (22), the null hypothesis will be rejected with probability tending to 1. In fact, when (22) does not hold, the samples can be safely treated as independent for some applications. Let us take the multiple testing problem of correlations in (3) as an example. As we discussed in the introduction, the effect of the correlation among samples is quantified by $B_{n}=\frac{\|\boldsymbol{\Psi}\|_{\text{F}}^{2}}{n}$ and when $B_{n}\rightarrow 1$ , the limiting distribution of $\hat{\rho}_{ij}$ in (4) will be the same as the limiting distribution of $\hat{\rho}_{ij}$ estimated from independent samples. Indeed, when (22) does not hold and $p\geq cn^{\gamma}$ for some $\gamma>1$ , then we have $B_{n}\rightarrow 1$ (note that $\psi_{ii}=1$ for $1\leq i\leq n$ ). Thus, the correlation among samples is asymptotically negligible.

We next give a more general result on the relation between the lower bound of $\max_{1\leq i<j\leq n}|\psi_{ij}|$ , $n$ and $p$ . Here we only assume that $n\rightarrow\infty$ and $p$ is a function of $n$ (note that $p$ can be a constant). Let

[TABLE]

Theorem 2.7.

Let $\alpha,\beta>0$ and $\alpha+\beta<1$ . For any $a$ and $p$ satisfying

[TABLE]

we have

[TABLE]

Theorem 2.7 shows that when the dimension $p$ is fixed, it is impossible to reject $H_{0}$ correctly for all $\boldsymbol{\Psi}\in\mathcal{G}(a)$ with probability greater than $\alpha+\varepsilon$ , even when the lower bound $\max_{1\leq i<j\leq n}|\psi_{ij}|$ is close to one. It is easy to understand since the role of $n$ and $p$ is interchanged in our setting (we are testing an $n\times n$ covariance matrix $\boldsymbol{\Psi}$ with $p$ row samples). It also indicates that the independence test problem (1) is essentially different from the serial independence test in time series analysis. When $a=c/\sqrt{n}$ for some constant $c>0$ , we must require $p\geq c_{1}n\log n$ for some $c_{1}>0$ such that the independence testing problem (1) is solvable over $\mathcal{G}(a)$ . Note that Pan, Gao and Yang (2014) requires $0<\lim_{n\rightarrow\infty}\frac{p}{n}<\infty$ , which means that their method fails to deal with the setting $a\leq c/\sqrt{n}$ . On the other hand, by (22), such a setting of minimum signal $a\leq c/\sqrt{n}$ can be solved by the proposed test.

3 Multiple testing of correlations with correlated observations

As we mentioned in the introduction, when the independence hypothesis in (1) is rejected, there is potential risk of using inference methods developed based on independence assumption. To illustrate the effect of sample correlations, we study an important high-dimensional problem —the large-scale multiple testing of correlations when the samples are correlated, i.e.,

[TABLE]

When the samples are i.i.d. and normally distributed, the following classical result from Anderson (2003) (Theorem 4.2.4) establishes the limiting distribution of the sample correlation coefficient $\hat{\rho}_{ij}$ ,

[TABLE]

However, when the samples are correlated, the limiting distribution of $\hat{\rho}_{ij}$ in (25) does not hold. In fact, we can prove the following proposition.

Proposition 3.1.

Assume that the condition (C1) holds. We have

[TABLE]

where $B_{n}=\frac{\|\boldsymbol{\Psi}\|^{2}_{\text{F}}}{n}$ .

The term $B_{n}$ is same quantity as in (11), which represents the average correlation among $n$ samples. When the sample correlation is strong enough to extent such that $B_{n}\geq 1+c>1$ , the multiple testing procedure based on (25) (e.g., Benjamini—Hochberg (BH) procedure, Benjamini and Hochberg (1995)) will lead to many false positives. In fact, even when the correct limiting distribution in (26) is used, the resulting test will lose statistical power. For simplicity, let us consider a single testing problem $H_{0ij}:\rho_{ij}=0$ . To control the type I error rate when the samples are correlated, we need a larger critical value for $\hat{\rho}_{ij}$ , which is linear in $\sqrt{B_{n}}$ . That is, the rejection region should be $\{\hat{\rho}_{ij}:\sqrt{n}|\hat{\rho}_{ij}|\geq\sqrt{B_{n}}\Phi^{-1}(1-\alpha/2)\}$ , where $\Phi(\cdot)$ is the standard normal CDF function. Plugging in a ratio-consistent estimator of $B_{n}$ (e.g., using the method developed in Section 2.2 to estimate $\|\boldsymbol{\Psi}\|_{\text{F}}^{2}$ ), we will obtain a test that controls the type I error rate asymptotically. However, such a test will lose statistical power since the length of the acceptance region grows with the strength of the correlation among samples.

In this section, we propose a multiple testing procedure that asymptotically controls the FDR at the nominal level while maintaining good statistical power. Our method is based on the construction of a “sandwich estimator” of $\rho_{ij}$ by de-correlating the samples. In particular, first assume that $\boldsymbol{\mu}$ and $\boldsymbol{\Psi}$ are known. We transform the data X into $\textbf{Y}=(\boldsymbol{Y}_{1},\ldots,\boldsymbol{Y}_{n}):=(\textbf{X}-\boldsymbol{\mu}\textbf{1}^{\prime})\boldsymbol{\Psi}^{-1/2}\sim N(0,{\boldsymbol{\Sigma}}\otimes\textbf{I}_{n\times n})$ and columns $\boldsymbol{Y}_{k}\in\mathbb{R}^{p}$ for $1\leq k\leq n$ are i.i.d. from $N(0,{\boldsymbol{\Sigma}})$ . The corresponding “sample” covariance matrix of Y is (“sample” is quoted here since $\boldsymbol{\mu}$ and $\boldsymbol{\Psi}$ are unknown and thus $(\tilde{\sigma}_{ij,Y})_{p\times p}$ is not a real sample covariance matrix):

[TABLE]

Let $\tilde{\rho}_{ij,Y}=\frac{\tilde{\sigma}_{ij,Y}}{\sqrt{\tilde{\sigma}_{ii,Y}\tilde{\sigma}_{jj,Y}}}$ be the “sample” correlation coefficient matrix. By (25), we have

[TABLE]

which implies that the performance of the test statistic $\tilde{\rho}_{ij,Y}$ is the same as that of $\hat{\rho}_{ij}$ for independent samples. By comparing (28) and (26), the asymptotic variance of the sandwich estimator $\tilde{\rho}_{ij,Y}$ is always smaller than that of the sample correlation coefficient as $B_{n}\geq 1$ . Therefore, even when $B_{n}$ is bounded by a constant, the sandwich estimator is more powerful.

To obtain an estimate of $\tilde{\rho}_{ij,Y}$ , we need to estimate $\boldsymbol{\mu}$ and $\boldsymbol{\Psi}^{-1}:=(\gamma_{ij})_{n\times n}$ . Let $\hat{\boldsymbol{\mu}}=(\bar{X}_{1},\ldots,\bar{X}_{p})^{{}^{\prime}}$ be the estimator of $\boldsymbol{\mu}$ , where $\bar{X}_{i}=\frac{1}{n}\sum_{k=1}^{n}X_{ki}$ for $1\leq i\leq p$ . For estimating $\boldsymbol{\Psi}^{-1}$ , we adopt the CLIME estimator proposed in Cai, Liu and Luo (2011). In particular, following Cai, Liu and Luo (2011), we assume that $\boldsymbol{\Psi}^{-1}$ is a weakly sparse matrix, which belongs to the class,

[TABLE]

where $0\leq q<1/2$ , $\|\boldsymbol{\Psi}^{-1}\|_{l_{1}}=\max_{1\leq j\leq n}\sum_{i=1}^{n}|\gamma_{ij}|$ and the relationship among $M_{n}$ , $N_{n}$ and $s_{n}$ will be specified in the condition of Theorem 3.2. Let $\hat{\mathbb{R}}_{\boldsymbol{\Psi}}=(\hat{\psi}_{ij})_{n\times n}$ , where $\hat{\psi}_{ij}$ is defined in (5), and $\hat{\boldsymbol{\Gamma}}^{1}=(\hat{\gamma}^{1}_{ij})_{n\times n}$ be any optimal solution of the following optimization problem,

[TABLE]

Here, $\lambda_{n,p}=cM_{n}(\frac{N_{n}}{n}+\sqrt{\frac{\log n}{p}})$ , $c$ is a sufficiently large constant, $\|\boldsymbol{\Gamma}\|_{1}=\sum_{1\leq i,j\leq n}|\gamma_{ij}|$ and $\|\textbf{A}\|_{\infty}=\max_{1\leq i,j\leq n}|a_{ij}|$ for matrix $\textbf{A}=(a_{ij})_{n\times n}$ . We note that in the estimation of $\boldsymbol{\Psi}^{-1}$ , each row of X is treated as a sample, and thus the sample size is $p$ and the dimensionality is $n$ . The estimator of $\boldsymbol{\Psi}^{-1}$ , $\hat{\boldsymbol{\Gamma}}=(\hat{\gamma}_{ij})_{n\times n}$ , is obtained by a symmetrization of $\hat{\boldsymbol{\Gamma}}^{1}$ : $\hat{\gamma}_{ij}=\hat{\gamma}_{ij}^{1}I\{|\hat{\gamma}_{ij}^{1}|\leq|\hat{\gamma}_{ji}^{1}|\}+\hat{\gamma}_{ji}^{1}I\{|\hat{\gamma}_{ij}^{1}|>|\hat{\gamma}_{ji}^{1}|\}.$ Based on the estimated $\hat{\boldsymbol{\mu}}$ and $\hat{\boldsymbol{\Gamma}}$ , we define the “sandwich estimator” of $(\tilde{\sigma}_{ij,Y})_{p\times p}$ , $(\hat{\sigma}_{ij,Y})_{p\times p}=\frac{1}{n}(\textbf{X}-\hat{\boldsymbol{\mu}}\textbf{1}^{\prime})\hat{\boldsymbol{\Gamma}}(\textbf{X}-\hat{\boldsymbol{\mu}}\textbf{1}^{\prime})^{\prime}$ with each

[TABLE]

where $\textbf{X}_{\cdot,i}=(X_{1i},\ldots,X_{ni})^{\prime}$ is the $i$ -th column of X. The corresponding correlation coefficient

[TABLE]

We note that the “sandwich estimator” $\hat{\rho}_{ij,Y}$ is related to the Knorm correlation proposed by Teng and Huang (2009), which estimates $\boldsymbol{\Psi}^{-1}$ in $\tilde{\rho}_{ij,Y}$ by the inverse of maximum likelihood estimator (MLE) of $\boldsymbol{\Psi}$ . However, there is no closed-form solution for MLE of the matrix-variate normal distribution. So it is difficult to develop limiting distribution results for the Knorm correlation in high-dimensional settings.

In the proof of Theorem 3.2, we will show that $\sqrt{n}\max_{1\leq i\leq j\leq p}|\hat{\rho}_{ij,Y}-\tilde{\rho}_{ij,Y}|=o_{\mathbb{P}}(1/\sqrt{\log p})$ . Combining it with (28), we have $\frac{\sqrt{n}(\hat{\rho}_{ij,Y}-\rho_{ij})}{1-\rho^{2}_{ij,Y}}\Rightarrow N(0,1)$ . Therefore, for each single test problem $H_{0ij}:\rho_{ij}=0$ , we propose the test statistic,

[TABLE]

and the null $H_{0ij}$ is rejected when $|\hat{T}_{ij}|\geq t$ for some threshold level $t>0$ .

To implement the large-scale multiple testing of correlations, we adopt the popular BH method (Benjamini and Hochberg, 1995). In particular, we need to search for a threshold $\hat{t}$ for $|\hat{T}_{ij}|$ that controls the false discovery proportion (FDP) and false discovery rate (FDR) defined as follows while rejecting as many hypotheses as possible,

[TABLE]

where $\mathcal{H}_{0}=\{(i,j):~{}\rho_{ij}=0,~{}1\leq i<j\leq p\}$ is the set of null. Therefore, an ideal choice of the threshold level for a pre-specified significance level $0<\alpha<1$ should be

[TABLE]

The oracle threshold level $\hat{t}_{orc}$ cannot be computed since $\mathcal{H}_{0}$ is unknown. Nevertheless, since $\hat{T}_{ij}\Rightarrow N(0,1)$ under the null $\rho_{ij}=0$ , the numerator in (33), $\sum_{(i,j)\in\mathcal{H}_{0}}I\{|\hat{T}_{ij}|\geq t\}$ , can be approximated by $2(1-\Phi(t))Card(\mathcal{H}_{0})$ . The quantity $Card(\mathcal{H}_{0})$ can be further bounded from above by $(p^{2}-p)/2$ and such an upper bound is good when ${\boldsymbol{\Sigma}}$ is sparse, which is a common setup. Therefore, we propose the following threshold level $\hat{t}$ and the corresponding multiple testing procedure.

For a given $0<\alpha<1$ , let

$\displaystyle\hat{t}=\inf\Big{\{}t\geq 0:\frac{(1-\Phi(t))(p^{2}-p)}{\max\{\sum_{1\leq i<j\leq p}I\{|\hat{T}_{ij}|\geq t\},1\}}\leq\alpha\Big{\}}.$

(34)

For $1\leq i<j\leq p$ , we reject $H_{0ij}$ if $|\hat{T}_{ij}|\geq\hat{t}$ .

The next theorem shows that the proposed procedure controls the FDP and FDR at level $\alpha$ asymptotically. Recall the definition of $\mathcal{H}_{0}$ . Let $h_{0}=Card(\mathcal{H}_{0})$ , $\mathcal{H}_{1}=\{(i,j):~{}\rho_{ij}\neq 0,~{}1\leq i<j\leq p\}$ , $h_{1}=Card(\mathcal{H}_{1})$ and $h=(p^{2}-p)/2$ . For a given $\gamma>0$ , we further define the following sets

[TABLE]

Theorem 3.2.

Assume that the condition $(C1)$ holds, $p\leq n^{r}$ for some $r>0$ , and $\boldsymbol{\Psi}^{-1}\in\mathcal{G}$ defined in (29) with

[TABLE]

Suppose that $h_{1}\leq\kappa h$ for some $\kappa<1$ ,

[TABLE]

and $\max_{1\leq i\leq p}Card(\mathcal{A}_{i}(\gamma))=O(p^{\rho})$ for some $\rho<1/2$ and $\gamma>0$ . We have

[TABLE]

We briefly comment on the condition in Theorem 3.2. We note that in the estimation of $\boldsymbol{\Psi}^{-1}$ , $n$ plays the role of dimensionality and $p$ plays the role of the sample size. The condition in (36) ensures that $p$ is sufficiently large so that the estimation of $\boldsymbol{\Psi}^{-1}$ is accurate. On the other hand, the assumption that $p$ is sufficiently large is also natural for high-dimensional applications (e.g., genetic studies). The assumption that $h_{1}\leq\kappa h$ for some $\kappa<1$ is necessary. Since if $h_{0}=o(h)$ , then almost all of $\rho_{ij}$ are non-zeros and simply rejecting all the hypotheses will lead to $\text{FDR}\rightarrow 0$ . The condition in (37), which is only slightly stronger than the condition that the number of true alternatives goes to infinity, is a nearly necessary condition. In fact, Proposition 2.1 in Liu and Shao (2014) shows that if the number of true alternatives is fixed, then it is impossible for the BH method to control the FDP with probability tending to one at any desired level. The condition on $\max_{1\leq i\leq p}Card(\mathcal{A}_{i}(\gamma))$ is essentially a sparsity condition for ${\boldsymbol{\Sigma}}$ . In particular, when $p\geq n^{r_{1}}$ with $r_{1}>1$ and the number of nonzero entries in each row of ${\boldsymbol{\Sigma}}$ is on the order of $\sqrt{n}$ (which is a common assumption for sparse ${\boldsymbol{\Sigma}}$ ), then the condition on $\max_{1\leq i\leq p}Card(\mathcal{A}_{i}(\gamma))$ automatically holds.

4 Numerical results

In this section, we provide numerical results to demonstrate the performance of the proposed test methods. Due to space constraints, some simulations and real experiments are provided in Appendix. Recall that the $p\times n$ data matrix X follows a matrix-variate normal distribution $N(\boldsymbol{\mu}\mathbf{1}^{\prime},{\boldsymbol{\Sigma}}\otimes\boldsymbol{\Psi})$ . The matrix ${\boldsymbol{\Sigma}}$ (and $\boldsymbol{\Psi}$ ) is generated from one of the following classes of matrices:

Auto-correlation matrix where $\sigma_{ij}=\rho^{|i-j|}$ and $\rho$ is set to 0.2, 0.5 or 0.8. The larger the parameter $\rho$ is, the stronger the correlation. 2. 2.

Banded matrix (“band” for short) where $\sigma_{ii}=1$ , $\sigma_{i,i+1}=\sigma_{i+1,i}=0.6$ , $\sigma_{i,i+2}=\sigma_{i+2,i}=0.3$ , and $\sigma_{ij}=0$ for $|i-j|\geq 3$ . 3. 3.

Block diagonal matrix (“block” for short) where the main diagonal blocks are $10\times 10$ square matrices and off-diagonal blocks are zeros matrices. A $10\times 10$ main diagonal block $\textbf{B}=(b_{ij})_{10\times 10}$ has $b_{ii}=1$ and $b_{ij}=0.5$ when $i\neq j$ .

In simulations, we fix $\boldsymbol{\mu}=0$ and the level of significance $\alpha=0.05$ .

4.1 Independence Test

We consider the independence test problem in (1). All the reported empirical sizes and powers are averaged over 5,000 independent replications. In Table 1, we consider relatively large $n$ and $p$ and show the empirical type I error rate (a.k.a. the empirical size) of the proposed test statistics $\hat{T}_{n,p}$ in (10) under the null when $\boldsymbol{\Psi}=\mathbf{I}_{n\times n}$ . From Table 1, as the sample size $n$ and dimension $p$ increase, the empirical type I error rates get closer to the nominal level of 0.05, which verifies the validity of the proposed test statistics shown in Theorem 2.4.

Recalling in the construction of $\hat{T}_{n,p}$ (in particular, in the term $\hat{A}_{p}$ ), we threshold the sample covariance matrix $\hat{{\boldsymbol{\Sigma}}}$ as in (12), where the threshold level involves the estimator $\hat{B}_{n}$ of $B_{n}=\|\boldsymbol{\Psi}\|_{\text{F}}^{2}/n$ . We compare the empirical sizes of the test statistics in the same form as $\hat{T}_{n,p}$ in (10) but using different estimators (listed as follows) of $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ in estimating $A_{p}=\frac{p\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}}{(tr({\boldsymbol{\Sigma}}))^{2}}$ :

CV (Fan, Rigollet and Wang, 2015): plugin estimator based on thresholded ${\boldsymbol{\Sigma}}$ with the threshold level tuned by cross validation (CV). 2. 2.

Bai: method proposed by Bai and Saranadasa (1996). 3. 3.

CQ: method proposed by Chen and Qin (2010). 4. 4.

$\hat{B}_{n}$ : plugin estimator based on thresholded ${\boldsymbol{\Sigma}}$ as in (12) with the proposed estimator $\hat{B}_{n}$ for setting the threshold level.

We would like to make it clear that for the ease of presentation, the acronyms CV, Bai and CQ refer to the proposed test statistics in the form of $\hat{T}_{n,p}$ while using the corresponding method to construct the estimator of $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ in $A_{p}$ .

In Table 2, we show the comparison results when $n=50$ or $100$ and $p=1000$ . Of note, we only present smaller $n$ and $p$ cases since the computational cost of CV and CQ are expensive for large $n$ and $p$ and the case $n=50/100$ and $p=1000$ has been sufficient to demonstrate the points below. We show that the CV cannot control the type I error below the nominal level 0.05. The CQ leads the type I error rates that are closest to the nominal level. However, as we will show later, it has a lower statistical power. The empirical sizes of the proposed test statistics (with the thresholding level $\hat{B}_{n}$ in estimating $\|\boldsymbol{\Psi}\|_{\text{F}}^{2}$ ) are below the nominal level, which shows that the proposed test statistic is conservative when $n$ and $p$ are small. This results from the slow rate of convergence in distribution for the max-type test statistics (Liu, Lin and Shao, 2008). For small $n$ and $p$ , one useful way to make the test less conservative is to adopt the critical value from a Monte-Carlo simulation instead of the one derived from the limiting distribution. In particular, we can generate $M$ (e.g., $M=10000$ in our simulation) replications of $p\times n$ data matrix, where each one is randomly drawn from $N(\textbf{0},\mathbf{I}_{p\times p}\otimes\mathbf{I}_{n\times n})$ under the null. We compute the corresponding test statistics $\hat{T}^{(i)}_{n,p}$ , $1\leq i\leq M$ , for each randomly generated data matrix and let $c_{\alpha}$ be the $(1-\alpha)$ -quantile of the empirical distribution $\frac{1}{M}\sum_{i=1}^{M}I\{\hat{T}^{(i)}_{n,p}\leq t\}$ . We reject the null whenever the our test statistic $\hat{T}_{n,p}\geq c_{\alpha}$ (note that the statistic is the same and only the critical value is changed). As shown in the additional experimental results in Section D.1 in Appendix, using a Monte-Carlo based critical value will push the empirical size closer to the nominal $\alpha$ when $n$ and $p$ are small.

Then, we compare statistical power of the proposed test procedure when using different estimators of $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ in our test statistic. In particular, we first consider $\boldsymbol{\Psi}=(\rho^{|i-j|})_{n\times n}$ and vary the parameter $\rho$ from 0.55 to 0.85. The larger the $\rho$ is, the stronger the correlation among samples. For different types of ${\boldsymbol{\Sigma}}$ , the empirical powers are all 100% for our method (Figure 1). The powers using Bai and CQ drop to zeros when $\rho$ becomes larger than $0.7$ . Since both methods for estimating $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ are developed under the i.i.d. assumption, when the sample correlation becomes stronger, the estimation of $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ is inaccurate, which leads to inferior statistical powers. The CV-based thresholding method has maintained statistical power 100% for a wider range of $\rho$ . However, we note that the CV fails to control the type I error rate as shown in Table 2. In Section D.2 in Appendix, we further demonstrate the superiority of using the proposed estimator for $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ in terms of empirical powers when $\boldsymbol{\Psi}$ is a block diagonal matrix.

It is also of interest to investigate the performance of the proposed test statistics when $n$ and $p$ are comparable. We vary $p$ from 50 to 2,000 and consider four settings for the sample size, $n=0.5p$ , $n=p$ , $n=2p$ and $n=3p$ . We set ${\boldsymbol{\Sigma}}=(0.5^{|i-j|})_{i,j}$ and show the empirical type I error rates and powers for different $\boldsymbol{\Psi}$ in Figure 2. As one can see from Figure 2(a), the empirical type I error rates are approaching the nominal level $\alpha=0.05$ as $p$ increases. Notably, when the ratio between $n$ and $p$ increases, the test statistic becomes more conservative. From Figure 2(b)-2(d), although the powers are low when $p$ is very small (i.e., $p=50$ ), they are 100% for moderate and large $p$ . This simulation study suggests that the proposed independence test performs reasonably well when $n$ and $p$ are comparable.

Moreover, we consider the setting in which ${\boldsymbol{\Sigma}}$ does not satisfy the conditions (C1)-(C2). In particular, we choose an equicorrelation covariance matrix ${\boldsymbol{\Sigma}}=0.85\cdot\mathbf{1}\mathbf{1}^{\prime}+0.15\cdot\mathbf{I}_{p\times p}$ , which enforces very strong correlation among every pairs of variables. It is easy to see that $\lambda_{\max}({\boldsymbol{\Sigma}})=0.85\cdot p+0.15$ and $\sum_{k=1}^{p}|\sigma_{jk}|^{\tau}=1+0.85^{\tau}\cdot(p-1)$ . Both quantities are linear in $p$ and thus cannot be bounded by constants as $p$ grows, which violates the assumptions (C1)-(C2). Hence, it is expected that the results of type I error rate control in Proposition 2.3 and Theorem 2.4 will not hold for this model. This is verified by our simulation study. In particular, we vary $n$ and $p$ and show the type I error rates in Figure 3. When $p$ grows and the conditions (C1)-(C2) no longer hold, the type I error rate exceeds the nominal level $\alpha=0.05$ (represented by the green line).

Due to space limitations, we relegate the other settings of $\boldsymbol{\Psi}$ and some additional simulation studies for independence testing to Section D in Appendix, which includes:

We compare empirical powers when the $\boldsymbol{\Psi}$ is a block diagonal matrix and demonstrate the superiority of the proposed method. 2. 2.

To empirically verify the result in Theorem 2.5, we consider the case of extremely sparse $\boldsymbol{\Psi}$ where $\psi_{12}=\psi_{21}=\kappa\sqrt{\frac{\log n}{p}}$ and all the other off-diagonal elements are zeros. The experimental results show that for different types of ${\boldsymbol{\Sigma}}$ , the empirical powers all become 100% as $\kappa$ increases, which demonstrates that our test statistic can successfully reject the null even when the $\boldsymbol{\Psi}$ is extremely sparse. 3. 3.

To provide more intuitive comparisons between different methods for estimating $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ , we directly show the relative estimation error under different settings. This experiment demonstrates that the proposed thresholding estimator greatly outperforms its competitors when the correlation among samples is large.

4.2 Large-scale Multiple Testing of Correlations

In this section, we conduct both simulated and real data analysis to demonstrate performance of the proposed “sandwich” estimator in (32) for large-scale multiple testing of correlations in (24).

4.2.1 FDP and Power of Simulated Results

In simulated study, we compare the BH procedure based on four different estimators of $\rho_{ij}$ :

The proposed sandwich estimator $\hat{T}_{ij}(\lambda_{n,p})=\sqrt{n}\hat{\rho}_{ij,Y}$ in (32), where $\boldsymbol{\Psi}^{-1}$ is estimated by CLIME (Cai, Liu and Luo, 2011). Further, we adopt the data-driven approach in Liu (2013) to tune the $\lambda_{n,p}$ in CLIME (see (30)). In particular, the parameter $\lambda_{n,p}$ is selected by,

[TABLE] 2. 2.

The classical sample correlation estimator $\sqrt{n}\hat{\rho}_{ij}$ based on sample independence assumption. 3. 3.

The variance corrected sample correlation estimator $\frac{\sqrt{n}\hat{\rho}_{ij}}{\sqrt{B_{n}}}$ , where true $B_{n}=\|\boldsymbol{\Psi}\|_{\text{F}}^{2}/n$ is assumed to be known. 4. 4.

The proposed sandwich estimator in (32) with the true $\boldsymbol{\Psi}^{-1}$ , which serves as an oracle benchmark.

In Table 3, we report the averaged FDP and power over 100 replications. The matrix ${\boldsymbol{\Sigma}}$ is chosen to be either banded or block diagonal matrix, both of which are sparse. As we can see from Table 3, the FDPs of the BH procedure based on sandwich estimator are below $\alpha=0.05$ . The empirical powers get close to one as the sample size $n$ increases and are only slightly worse than the powers of the oracle benchmark with true $\boldsymbol{\Psi}^{-1}$ . For the classical sample correlation estimator $\sqrt{n}\hat{\rho}_{ij}$ , the FDP can be very large (e.g., around 50% when $\psi_{ij}=0.5^{|i-j|}$ and more than 95% when $\psi_{ij}=0.8^{|i-j|}$ ). This verifies our result showing that naïvely using the sample correlation estimator developed under the sample independence assumption will lead to many false positives. Using the variance corrected sample correlation estimator $\frac{\sqrt{n}\hat{\rho}_{ij}}{\sqrt{B_{n}}}$ will help reduce the number of false positives and control FDP as shown in Table 3, which is consistent with our result in Proposition 3.1. However, as we observe from Table 3, even when the true $B_{n}$ is used, the powers of $\frac{\sqrt{n}\hat{\rho}_{ij}}{\sqrt{B_{n}}}$ are quite low, especially when the correlation among samples becomes stronger. The reason for this low power is explained in the paragraph below Proposition 3.1.

In Table 4, we consider the setting when the samples are i.i.d., in which case the classical sample correlation estimator should be used as it is based on sample independence assumption. We also note that when samples are i.i.d., both the variance corrected sample correlation estimator $\frac{\sqrt{n}\hat{\rho}_{ij}}{\sqrt{B_{n}}}$ ( $B_{n}=1$ ) and the sandwich estimator with true $\boldsymbol{\Psi}^{-1}=\mathbf{I}_{n\times n}$ reduce to the classical sample correlation estimator. The power when using the sandwich estimator with the estimated $\boldsymbol{\Psi}^{-1}$ by CLIME is quite close to the power when using the benchmark sample correlation estimator (Table 4), which demonstrates the robustness of the proposed method.

We also conducted real experiments on correlation tests for yeast genomics data and stock data, which are detailed in Section D.5 in Appendix.

5 Discussion

This paper studies the sample/column independence test and multiple testing of Pearson’s correlation coefficients in a high-dimensional setting. The main difficulty in column independence test arises from the correlation among different variables, which is characterized by the covariance matrix ${\boldsymbol{\Sigma}}$ . If ${\boldsymbol{\Sigma}}$ is known, the data matrix can be transformed as ${\boldsymbol{\Sigma}}^{-1/2}\textbf{X}\sim N({\boldsymbol{\Sigma}}^{-1/2}\boldsymbol{\mu}\textbf{1}^{\prime},\textbf{I}_{p\times p}\otimes\boldsymbol{\Psi})$ , based on which the independence test can be directly carried out using existing approaches (e.g., Jiang (2004); Liu, Lin and Shao (2008)). However, the covariance matrix ${\boldsymbol{\Sigma}}$ is unknown. Although the problem of estimating ${\boldsymbol{\Sigma}}^{-1}$ has been well studied, the optimal convergence rate in matrix $\ell_{1}$ -norm is known to be $O(s_{p}\|{\boldsymbol{\Sigma}}^{-1}\|_{l_{1}}\sqrt{(\log p)/n})$ , where $s_{p}$ is the row sparsity level of ${\boldsymbol{\Sigma}}^{-1}$ (see, e.g., Cai, Liu and Zhou (2015)). However, such a rate is not fast enough for establishing a limiting null distribution of the test statistic based on the estimated ${\boldsymbol{\Sigma}}^{-1}$ . In particular, from the proof of Theorem 2.4, when using max-type test statistics, to eliminate the effect of the estimation error from ${\boldsymbol{\Sigma}}^{-1}$ and establish a limiting null distribution, the convergence rate needs to be $o_{\mathbb{P}}(1/\sqrt{p\log n})$ . As $p$ can be $\exp(o(n^{\gamma}))$ for some $\gamma>0$ in an ultra high-dimensional setting, one cannot solve the independence test problem in (1) by simply plugging in the estimator of ${\boldsymbol{\Sigma}}^{-1}$ . On the other hand, when using the row sample correlation matrix $(\hat{\psi}_{ij})$ by treating each row of $\boldsymbol{X}$ as a sample, we only need to estimate $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ instead of ${\boldsymbol{\Sigma}}^{-1}$ . The problem of estimating $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ from correlated samples has been successfully addressed in Section 2.2. We would also like to note that in the multiple testing problem of Pearson’s correlation coefficients, such a difficulty no longer exists. In fact, when estimating $\boldsymbol{\Psi}^{-1}$ from row samples of $\boldsymbol{X}$ , the roles of $n$ and $p$ has interchanged (i.e., the sample size becomes $p$ and the dimensionality becomes $n$ ) and, thus, the estimation problem is conducted in a relatively lower dimensional setting.

Appendix

A Proof of the results in Section 2 for sample independence test

Before the proof, we first provide representations of $\hat{\psi}_{ij}$ and $\hat{\sigma}_{ij}$ that will be used throughout our proof. Let us transform the each sample by defining $\boldsymbol{Z}_{i}={\boldsymbol{\Sigma}}^{-1/2}(\boldsymbol{X}_{i}-\boldsymbol{\mu})=:(Z_{i1},\ldots,Z_{ip})^{\prime}$ and $\bar{\boldsymbol{Z}}=\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{Z}_{i}$ . Then, $\hat{\psi}_{ij}$ in (5) can be written as

[TABLE]

By the property of matrix-variate normal distributions (Gupta and Nagar, 1999), we have $(\boldsymbol{Z}_{1},\ldots,\boldsymbol{Z}_{n})={\boldsymbol{\Sigma}}^{-1/2}(\textbf{X}-\boldsymbol{\mu}\textbf{1}^{\prime})\sim N(0,\textbf{I}_{p\times p}\otimes\boldsymbol{\Psi})$ . Let ${\boldsymbol{\Sigma}}=\mathbf{U}^{\prime}\mathbf{D}\mathbf{U}$ , where $\mathbf{U}$ is an orthogonal matrix and the diagonal matrix $\mathbf{D}=diag(\lambda_{1},\ldots,\lambda_{p})$ , where $\lambda_{1},\ldots,\lambda_{p}$ are the eigenvalues of ${\boldsymbol{\Sigma}}$ . So we have $(\boldsymbol{Z}_{i}-\bar{\boldsymbol{Z}})^{\prime}{\boldsymbol{\Sigma}}(\boldsymbol{Z}_{j}-\bar{\boldsymbol{Z}})=(\mathbf{U}(\boldsymbol{Z}_{i}-\bar{\boldsymbol{Z}}))^{\prime}\mathbf{D}(\mathbf{U}(\boldsymbol{Z}_{j}-\bar{\boldsymbol{Z}}))$ . Since $\mathbf{U}$ is an orthogonal matrix,

[TABLE]

Let us denote column of $\mathbf{U}(\boldsymbol{Z}_{1},\ldots,\boldsymbol{Z}_{n})$ by $(\eta_{i1},\ldots,\eta_{ip})^{\prime}=\mathbf{U}\boldsymbol{Z}_{i}$ and let $\bar{\eta}_{k}=\frac{1}{n}\sum_{i=1}^{n}\eta_{ik}$ . We have

[TABLE]

Note that from (38), it is easy to see that rows of $\mathbf{U}(\boldsymbol{Z}_{1},\ldots,\boldsymbol{Z}_{n})$ , i.e., $(\eta_{1k},\ldots,\eta_{nk})$ , for $1\leq k\leq p$ , are i.i.d. $N(0,\boldsymbol{\Psi})$ random vectors. Therefore, we have

[TABLE]

Following the representation the $\hat{\psi}_{ij}$ in (39), we rewrite $\hat{\sigma}_{ij}$ in a more explicit form as follows.

[TABLE]

where the first term in (41) can be written as,

[TABLE]

Here, $(\xi_{k1},\ldots,\xi_{kp})$ , $1\leq k\leq n$ are independent $N(0,{\boldsymbol{\Sigma}})$ random vectors.

We also provide a few simple implications of conditions (C1) and (C2), which will be used throughout the proof. By (C1), there exists some constant $c>0$ such that $c^{-1}\leq\frac{tr({\boldsymbol{\Sigma}})}{p}=\frac{1}{p}\sum_{i=1}^{p}\lambda_{i}\leq c$ , $c^{-2}\leq\frac{\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}}{p}=\frac{1}{p}\sum_{i=1}^{p}\lambda_{i}^{2}\leq c^{2}$ , $c^{-2}\leq\frac{\|\boldsymbol{\Psi}\|_{\text{F}}^{2}}{n}=\frac{1}{n}\sum_{k=1}^{n}\nu_{k}^{2}\leq c^{2}.$ We also note that $\frac{tr(\boldsymbol{\Psi})}{n}=1$ since $\psi_{ii}=1$ for $1\leq i\leq n$ . The condition (C2) provides us an upper bound on the absolute value of the sum of each row of $\boldsymbol{\Psi}$ . In fact, by Hölder’s inequality, we have

[TABLE]

A.1 Proof of Theorem 2.1

To prove Theorem 2.1, we first introduce two technical lemmas. Their proofs are quite complicated with a lot of careful calculations and technical details. Therefore, we relegate their proofs to Section B.

Lemma A.1.

We have for any $\varepsilon>0$ ,

[TABLE]

uniformly in $x\in[0,o(n^{\frac{1}{2}\wedge(\frac{1}{\tau}-\frac{1}{2})}))$ , where $C$ does not depend on $i,j$ .

Lemma A.2.

Let $\hat{\boldsymbol{\Upsilon}}=(\hat{\psi}_{ij})_{1\leq i,j\leq n}$ , $\hat{\gamma}_{n}=\|\hat{\boldsymbol{\Upsilon}}\|^{2}_{\mathrm{F}}-\frac{1}{p}(tr(\hat{\boldsymbol{\Upsilon}}))^{2}$ and $\gamma_{n}=\left(\frac{tr({\boldsymbol{\Sigma}})}{p}\right)^{2}\|{\boldsymbol{\Psi}}\|^{2}_{\mathrm{F}}$ . We have

[TABLE]

where $\{a_{n}\}$ are real numbers satisfying $1\leq a_{n}\leq 1+c_{1}n/p$ for some constant $c_{1}>0$ , $\{d_{n},f_{n}\}$ are random variables satisfying

[TABLE]

where $M>0$ can be arbitrarily large and $C$ depends on $M$ .

Using Markov’s inequality and Lemma A.1, Lemma A.2 further implies that for any $\epsilon>0$ ,

[TABLE]

Recall that $p\geq cn$ . Let $\hat{\varrho}_{ij}=\frac{\hat{\rho}_{ij}}{1-\hat{\rho}^{2}_{ij}}$ and $\lambda=\delta\sqrt{\frac{\hat{B}_{n}\log p}{n}}$ . Recall that

[TABLE]

We first analyze the term $\sum_{1\leq i\neq j\leq p}\sigma^{2}_{ij}I\{|\hat{\varrho}_{ij}|<\lambda\}$ . By Lemma A.1 with $x=C_{1}\sqrt{\log p}$ , we have for any large $M>0$ , there exists some $C_{1}>0$ such that

[TABLE]

uniformly in $1\leq i,j\leq p$ . This inequality, together with (45), implies that there exists some constant $C>0$ such that

[TABLE]

We also note that when $|\hat{\varrho}_{ij}|\leq\lambda$ , we have $|\hat{\rho}_{ij}|\leq\lambda$ , i.e., $|\hat{\sigma}_{ij}|\leq\lambda\sqrt{\hat{\sigma}_{ii}\hat{\sigma}_{jj}}$ , which by (47) implies that, for some $C>0$ ,

[TABLE]

By (47) and (48), with probability greater than $1-O(\frac{1}{np}+n^{-M})$ ,

[TABLE]

where the last inequality is due to the condition (C2).

Let $\epsilon>0$ be a sufficiently small number and

[TABLE]

We first analyze the term $I_{2}$ . For $|\sigma_{ij}|<\epsilon\lambda$ and using (47), we have $|\hat{\sigma}_{ij}|\leq(C+\epsilon)\lambda$ and thus $|\hat{\rho}_{ij}|\leq\sqrt{\epsilon}$ with probability greater than $1-O\Big{(}\frac{1}{np}+n^{-M}\Big{)}$ . Therefore, we have $|\hat{\varrho}_{ij}|\geq\lambda$ implies that $|\hat{\rho}_{ij}|\geq(1-\hat{\rho}_{ij}^{2})\lambda\geq(1-\epsilon)\lambda$ . Thus, we have with probability greater than $1-O\Big{(}\frac{1}{np}+n^{-M}\Big{)}$ ,

[TABLE]

for some $c>0$ , uniformly in $1\leq i,j\leq p$ . Now, by (45) and (46), we have with probability greater than $1-O\Big{(}\frac{1}{np}+n^{-M}\Big{)}$ ,

[TABLE]

Since $\delta>\sqrt{2}$ , by Lemma A.1 with $x=\delta(1-2c\epsilon)\sqrt{\log p}$ , we can let $\epsilon$ be sufficiently small such that for some $\epsilon_{1}>0$ ,

[TABLE]

This, together with Markov inequality, yields that

[TABLE]

Combining the above inequalities, we obtain that $I_{2}=O_{\mathbb{P}}(p^{1-\epsilon_{1}}\sqrt{\frac{\log p}{n}})$ . For the term $I_{1}$ , by (46) we have with probability greater than $1-O(p^{-M})$ ,

[TABLE]

It implies that $|\sum_{i=1}^{p}\sum_{j=1}^{p}\hat{\sigma}^{2}_{ij,thr}-\sum_{i=1}^{p}\sum_{j=1}^{p}\sigma^{2}_{ij}|=O_{\mathbb{P}}(\lambda^{(2-\tau)\wedge 1}p+p^{1-\epsilon_{1}}\lambda)$ and hence

[TABLE]

By (47), we have $\max_{1\leq i\leq j\leq p}|\hat{\sigma}_{ij}-\sigma_{ij}|=O_{\mathbb{P}}(\lambda)$ , which implies that $\mathrm{tr}(\hat{{\boldsymbol{\Sigma}}}_{thr})/\mathrm{tr}({\boldsymbol{\Sigma}})=1+O_{\mathbb{P}}(\lambda)$ . Combing this and (49), the proof is completed. ∎

A.2 Proof of Proposition 2.2

Take $\boldsymbol{\Psi}=(\psi_{ij})$ with $\psi_{ij}=\rho^{|j-i|}$ . We first prove that, for $0<\nu<1/2$ ,

[TABLE]

By ${\boldsymbol{\Sigma}}=\mathbf{I}_{p\times p}$ , we can see that $\hat{\sigma}_{ij}$ and $\hat{\sigma}_{kl}$ are independent for distinct $\{i,j,k,l\}$ . Note that $\hat{\sigma}_{ij}=\tilde{\sigma}_{ij}-\frac{n}{n-1}\bar{X}_{i}\bar{X}_{j}$ . It is easy to show that $\max_{1\leq i\neq j\leq p}\mathbb{E}\tilde{\sigma}^{4}_{ij}=O(n^{-2})$ and $\max_{1\leq i\neq j\leq p}\mathbb{E}(\bar{X}_{i}\bar{X}_{j})^{4}=O(n^{-4})$ . This yields that $\mathbb{E}\hat{\sigma}^{4}_{ij,1}=O(n^{-2})$ and $\mathbb{E}\hat{\sigma}^{2}_{ij,1}\hat{\sigma}^{2}_{ik,1}=O(n^{-2})$ . So we have

[TABLE]

This proves (50). We have from (66)

[TABLE]

By Cramér type large deviation results for independent random variables (Statulevičius (1966)), we have

[TABLE]

uniformly in $1\leq i\neq j\leq p$ . This shows that $\mathbb{E}\hat{\sigma}^{2}_{ij,1}\geq cp^{-\nu/2}/n$ and

[TABLE]

So we have $\frac{1}{p}\|\hat{{\boldsymbol{\Sigma}}}\|^{2}_{\text{F}}\geq cp^{1-\nu}/n$ and $\tilde{A}_{p}/A_{p}\geq cp^{1-\nu}/n$ with probability tending to one. ∎

A.3 Proof of Proposition 2.3 and Theorem 2.4

By (39), we can write

[TABLE]

It can be verified that, under $H_{0}$ , $\mathbb{E}(\eta_{ik}\eta_{jk})=0$ and $\mathrm{Var}(\eta_{ik}\eta_{jk})=1$ . Let

[TABLE]

We first show that, under $H_{0}$ ,

[TABLE]

For four different indices $i,j,k,l$ , $\tilde{T}_{ij}$ and $\tilde{T}_{kl}$ are independent. By Theorem 1 in Arratia, Goldstein and Gordon (1989), we have

[TABLE]

where $t_{n}=4\log n-\log\log n+t$ , $\tau_{n}=\frac{n^{2}-n}{2}\mathbb{P}(\tilde{T}_{12}^{2}>t_{n})$ and

[TABLE]

To see this, let $U_{ij}=I\left\{\tilde{T}_{ij}^{2}>t_{n}\right\}$ and $\tau_{n}=\sum_{1\leq i<j\leq n}\mathbb{E}(U_{ij})$ . Theorem 1 in Arratia, Goldstein and Gordon (1989) shows that

[TABLE]

which gives (53).

By Cramér type moderate deviation results (see Theorem 2 in Statulevičius (1966)), we have

[TABLE]

for $\log n=o(p^{1/3})$ . This shows that $\tau_{n}\sim\frac{1}{\sqrt{8\pi}}e^{-t/2}$ and $b_{1n}\leq Cn^{-1}$ . For $b_{2n}$ , we have

[TABLE]

Note that $\mathrm{Var}(\eta_{1k}\eta_{2k}-\eta_{1k}\eta_{3k})=2$ . Again, by Cramér type moderate deviation results, for $\log n=o(p^{1/3})$ ,

[TABLE]

Similarly, $\mathbb{P}\Big{(}|\tilde{T}_{12}+\tilde{T}_{13}|\geq 2\sqrt{t_{n}}\Big{)}=(1+o(1))\frac{\sqrt{\log n}}{2\sqrt{\pi}}n^{-4}e^{-t}.$ Combining these inequalities, we have $b_{2n}\leq Cn^{-1}\sqrt{\log n}$ and (52) is obtained.

Let $\varepsilon_{ij,k}=-(\eta_{ik}+\eta_{jk})\bar{\eta}_{k}+\bar{\eta}^{2}_{k}$ . By (51), we have

[TABLE]

Further, under $H_{0}$ (i.e., $\boldsymbol{\Psi}=\mathbf{I}_{n}$ ), $\mathbb{E}\varepsilon_{ij,k}=-\frac{1}{n}.$ By the Bernstein-type inequality (Proposition 5.16 in Vershynin (2012)), it is easy to see that, for any $M>0$ , there exists some constant $C>0$ such that

[TABLE]

By (52), (54) and (55), we have

[TABLE]

Second, we show that

[TABLE]

We write

[TABLE]

where $\{\varepsilon_{jk}\}$ are i.i.d. $N(0,1)$ variables. The second equation is because under $H_{0}$ , $\mathbb{E}(\eta_{jk}-\bar{\eta}_{k})^{2}=\frac{n-1}{n}$ and $\sum_{k=1}^{p}\sigma_{kk}=\sum_{k=1}^{p}\lambda_{k}$ . The last equation follows from a well known fact that we can write $\sum_{j=1}^{n}(\eta_{jk}-\bar{\eta}_{k})^{2}=\sum_{j=1}^{n-1}\varepsilon^{2}_{jk}$ for some i.i.d. $N(0,1)$ random variables $\{\varepsilon_{jk}\}$ . By Markov’s inequality, (58) is proved. By (56) and (58), we have

[TABLE]

where $T_{ij}$ is the bias corrected statistic in (7). By the standard Bernstein-type tail bound, we have

[TABLE]

where $C$ depends on $M$ . Combining (59) with (55), we have

[TABLE]

This, together with Theorem 2.1, proves Theorem 2.4. ∎

A.4 Proof of Theorem 2.5

Recall that $\hat{\psi}_{ij}=\tilde{\psi}_{ij}+E_{ij}$ from (51), where

[TABLE]

For the second term in (61), it is easy to see that $\mathbb{E}\bar{\eta}_{k}^{2}=\frac{1}{n^{2}}\sum_{1\leq i,j\leq n}\psi_{ij}.$ Note that $\bar{\eta}_{k}/\sqrt{\mathbb{E}\bar{\eta}^{2}_{k}}$ , $1\leq k\leq p$ are i.i.d. $N(0,1)$ variables and thus $\bar{\eta}^{2}_{k}/\mathbb{E}\bar{\eta}^{2}_{k}$ are i.i.d. sub-exponential random variables. By the standard Bernstein-type tail bound (see Proposition 5.16 in Vershynin (2012)), we have for any $M>0$ , there exists $C>0$ such that

[TABLE]

Similarly, we have

[TABLE]

where the expectation $\mathbb{E}[(\eta_{ik}+\eta_{jk})\bar{\eta}_{k}]=\frac{1}{n}\sum_{l=1}^{n}(\psi_{il}+\psi_{jl})$ . The above two inequalities imply that (note that $\psi_{ii}=1$ ), with probability greater than $1-O(n^{-M})$ ,

[TABLE]

uniformly in $1\leq i,j\leq n$ . By (62),

[TABLE]

And

[TABLE]

Therefore, we have

[TABLE]

Recall that $\tilde{\psi}_{ij}=\frac{1}{p}\sum_{k=1}^{p}\lambda_{k}\eta_{ik}\eta_{jk}.$ By central limit theorem and note that $\sqrt{\mathrm{Var}(\tilde{\psi}_{ij})}=\frac{\mathrm{tr}({\boldsymbol{\Sigma}})}{p}\sqrt{\frac{A_{p}}{p}\left(\psi_{ii}\psi_{jj}+\psi_{ij}^{2}\right)}$ , we have

[TABLE]

uniformly in $x\in\mathbb{R}$ . By (20), without loss of generality, we can assume that $d_{12,\boldsymbol{\Psi}}\geq\delta\sqrt{A_{p}(\log n)/p}$ for some $\delta>2$ . By Theorem 2.1 and (60), we have

[TABLE]

By the inequality $\hat{T}_{n,p}\geq\frac{p}{\hat{A}_{p}}\frac{T^{2}_{12}}{\hat{\psi}_{11}\hat{\psi}_{22}}$ , we complete the proof of the theorem.∎

A.5 Proof of Theorems 2.6 and 2.7

Without loss of generality, we assume $\boldsymbol{\mu}=0$ . Let $U$ be an element chosen uniformly at random from the set $\{(i,j):1\leq i<j\leq n\}$ , which is independent of $\textbf{X}=(\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n})$ . Define $\boldsymbol{\Psi}_{U}=(\psi_{ij})_{n\times n}$ , where $\psi_{ij}=\psi_{ji}=a_{n,p}$ for $(i,j)=U$ and $0<a_{n,p}=:a<1$ , $\psi_{ii}=1$ for $1\leq i\leq n$ and $\psi_{ij}=0$ for all other $(i,j)$ . The matrix-variate normal density function of X when $\boldsymbol{\Psi}=\boldsymbol{\Psi}_{U}$ given $U$ is

[TABLE]

where $\textbf{x}\in\mathbb{R}^{p\times n}$ . Similarly, when $\boldsymbol{\Psi}=\mathbf{I}_{n\times n}$ , the density function of X is

[TABLE]

Let $d_{U}(\textbf{X})=f_{U}(\textbf{X})/f_{\mathbf{I}_{n\times n}}(\textbf{X})$ be the likelihood ratio. Write $\mathbb{E}_{U}(\cdot)$ as the expectation on $U$ and $\mathbb{E}_{0}(\cdot)$ as the expectation on X under $\boldsymbol{\Psi}=\mathbf{I}_{n\times n}$ . By the proof of Proposition 1 in Baraud (2002) (see page 594–596), it suffices to show that

[TABLE]

By the equation $tr(AB)=tr(BA)$ for any matrices $A$ and $B$ with proper sizes, we have

[TABLE]

where ${\boldsymbol{\Sigma}}^{-1/2}\textbf{X}=(\boldsymbol{Z}^{{}^{\prime}}_{1},\ldots,\boldsymbol{Z}^{{}^{\prime}}_{p})^{{}^{\prime}}=:\textbf{Z}$ . The row vectors $\boldsymbol{Z}_{i}$ , $1\leq i\leq p$ , of Z are independent $N(0,\mathbf{I}_{n\times n})$ random vectors when $\boldsymbol{\Psi}=\mathbf{I}_{n\times n}$ . Define

[TABLE]

Note that $[\mathbb{E}_{U}[d_{U}(\textbf{X})]]^{2}=\frac{1}{(n^{2}-n)^{2}/4}\sum_{m=1}^{3}\sum_{(i,j,k,l)\in\mathcal{S}_{m}}d_{ij}(\textbf{X})d_{kl}(\textbf{X}),$ where $d_{ij}(\textbf{X})=d_{U}(\textbf{X})$ when $U$ takes the value $(i,j)$ . Note that $\boldsymbol{\Psi}^{-1}_{U}=(\gamma_{ij})_{n\times n}$ with $\gamma_{ij}=\gamma_{ji}=-a/(1-a^{2})$ and $\gamma_{ii}=\gamma_{jj}=1/(1-a^{2})$ for $(i,j)=U$ , $\gamma_{ii}=1$ for all other diagonal entries and $\gamma_{ij}=0$ for all other off-diagonal entries. So for $(i,j,k,l)\in\mathcal{S}_{1}$ and $\boldsymbol{\Psi}=\mathbf{I}_{n\times n}$ , we have $d_{ij}(\textbf{X})$ and $d_{kl}(\textbf{X})$ are independent. Given any $U$ , we have $\mathbb{E}_{0}[d_{U}(\textbf{X})]=1$ . So

[TABLE]

For $(i,j,k,l)\in\mathcal{S}_{2}$ , we have $d_{ij}(\textbf{X})d_{kl}(\textbf{X})$ is identically distributed with $d_{12}(\textbf{X})d_{13}(\textbf{X})$ or $d_{13}(\textbf{X})d_{23}(\textbf{X})$ . Let $\varphi_{1},\ldots,\varphi_{n}$ be the eigenvalues of the matrix $\boldsymbol{\Psi}^{-1}_{(1,2)}+\boldsymbol{\Psi}^{-1}_{(1,3)}-2\mathbf{I}_{n\times n}$ . Then there are three $\varphi_{i}$ are nonzero with $\varphi_{1}=a^{2}/(1-a^{2})$ , $\varphi_{2}=(3a/2+\sqrt{2+a^{2}/4})a/(1-a^{2})$ and $\varphi_{3}=(3a/2-\sqrt{2+a^{2}/4})a/(1-a^{2})$ . It is easy to see that $\mathbb{E}\exp(-\varphi_{i}(N(0,1))^{2}/2)=\frac{1}{\sqrt{1+\varphi_{i}}}$ . So we have

[TABLE]

Similarly, we can show that $\mathbb{E}_{0}[d_{13}(\textbf{X})d_{23}(\textbf{X})]=1$ . This shows that

[TABLE]

There are two nonzero eigenvalues, $a/(1-a)$ and $-a/(1+a)$ , of the matrix $\boldsymbol{\Psi}^{-1}_{(1,2)}-\mathbf{I}_{n\times n}$ . For $(i,j,k,l)\in\mathcal{S}_{3}$ , we have $\mathbb{E}_{0}[d^{2}_{ij}(\textbf{X})]=(1-a^{2})^{-p/2}$ . This implies that

[TABLE]

where the last equation is due to the condition (23) in Theorem 2.7. Combining the above inequalities, we obtain (64), which completes the proof of Theorem 2.7. Note that for any $\delta<2$ , $\boldsymbol{\Psi}_{U}\in\mathcal{F}(\delta)$ for $a=c\sqrt{\log n/p}$ with some $c<2$ . Thus, Theorem 2.6 has also been proved. ∎

B Proof of technical lemmas

In this section, we provide the proofs of the technical lemmas (Lemma A.1 and A.2) in Section A.

B.1 Proof of Lemma A.1

Without loss of generality, we assume that $\boldsymbol{\mu}=0$ . Let $\tilde{\sigma}_{ij}=\frac{1}{n-1}\sum_{k=1}^{n}X_{ki}X_{kj}$ . By (41), we have $\hat{\sigma}_{ij}=\tilde{\sigma}_{ij}-\frac{n}{n-1}\bar{X}_{i}\bar{X}_{j}$ . Since $\mathrm{Cov}(\xi_{ki},\xi_{kj})=\sigma_{ij}$ , we obtain that $\mathrm{Var}(\xi_{ki}\xi_{kj})=\sigma^{2}_{ij}+\sigma_{ii}\sigma_{jj}$ . By classical Cramér type large deviation results for independent random variables (see Theorem 2 in Statulevičius (1966)), we have for any $\varepsilon>0$ ,

[TABLE]

uniformly in $x\in[0,o(\sqrt{n}))$ . For $\bar{X}_{i}$ , we have $\mathrm{Var}(\bar{X}_{i})=\frac{\sum_{1\leq k,l\leq n}\psi_{kl}\sigma_{ii}}{n^{2}}$ . By (43), $\mathrm{Var}(\bar{X}_{i})\leq Cn^{-1+0\vee(1-1/\tau)}$ , uniformly in $1\leq i\leq p$ . By the tail probability for normal distributions, we have

[TABLE]

for any $x>0$ . So

[TABLE]

for any $x>0$ . We have, uniformly for $x\in[0,o({n^{\frac{1}{2}\wedge(\frac{1}{\tau}-\frac{1}{2})}}))$ , $x^{2}\sqrt{\mathrm{Var}(\bar{X}_{i})\mathrm{Var}(\bar{X}_{j})}=o(x/\sqrt{n})$ . So for any $\delta>0$ and large $n$ ,

[TABLE]

uniformly for $x\in[0,o({n^{\frac{1}{2}\wedge(\frac{1}{\tau}-\frac{1}{2})}}))$ . By noticing that $\sum_{k=1}^{n}\nu_{k}=tr(\boldsymbol{\Psi})=n$ , the lemma follows from (65) and (66).

B.2 Proof of Lemma A.2

Recall the decomposition of $\hat{\psi_{ij}}$ in (51).

[TABLE]

This implies that,

[TABLE]

Therefore, by letting

[TABLE]

it is easy to verify that $f_{n}$ , $d_{n}$ and $a_{n}$ will make the equation (44) true. In the following, we prove that $a_{n},d_{n},f_{n}$ satisfy the properties in the lemma.

We first deal with the term $f_{n}$ . Recall the equation (62), where we show that with probability greater than $1-O(n^{-M})$ ,

[TABLE]

uniformly in $1\leq i,j\leq n$ . By the fact that $\frac{tr({\boldsymbol{\Sigma}})}{p}\leq\max_{k=1}^{p}\lambda_{k}\leq c$ ,

[TABLE]

By (43), we have

[TABLE]

with probability greater than $1-O(n^{-M})$ . For the second term in $f_{n}$ , note that $\frac{1}{p}\sum_{k=1}^{p}\lambda_{k}\eta_{ik}\eta_{jk}$ , which is the sum of i.i.d. sub-exponential random variables with $\mathbb{E}\left(\tilde{\psi}_{ij}\right)=\frac{\mathrm{tr}({\boldsymbol{\Sigma}})}{p}\psi_{ij}$ . By the concentration of $\tilde{\psi}_{ij}$ in (59) from the standard Bernstein-type tail bound, we have

[TABLE]

holds with probability larger than $1-O(n^{-M})$ . To bound the second term in $f_{n}$ (i.e., $\sum_{i=1}^{n}\sum_{j=1}^{n}E_{ij}\tilde{\psi}_{ij}$ ), we bound $\sum_{1\leq i,j\leq n}|E_{ij}|$ and $\sum_{1\leq i,j\leq n}|\psi_{ij}||E_{ij}|$ separately as follows. By Cauchy-Schwartz inequality, we have with probability greater than $1-O(n^{-M})$

[TABLE]

where the last inequality is due to (68), which implies that

[TABLE]

By (43), we have with probability greater than $1-O(n^{-M})$

[TABLE]

which implies that

[TABLE]

By (68), (69), (71) and (74),

[TABLE]

with probability larger than $1-O(n^{-M})$ .

By (59) and noticing that $\mathbb{E}\tilde{\psi}_{ii}=\frac{tr({\boldsymbol{\Sigma}})}{p}\leq c$ , we have for some $C>0$

[TABLE]

Also, by (72), $\frac{1}{\sqrt{p}}\sum_{i=1}^{n}E_{ii}=O(n^{-\min(\frac{1}{2},\frac{1}{\tau}-\frac{1}{2})}\sqrt{\log n})$ with probability larger than $1-O(n^{-M})$ . This implies that

[TABLE]

holds with probability greater than $1-O(n^{-M})$ . By (75) and (76), we prove $f_{n}$ satisfies the inequality in the lemma.

We next deal with $a_{n}$ . By the definition of $\tilde{\psi}_{ij}$ ,

[TABLE]

Therefore $\sum_{1\leq i,j\leq n}\mathbb{E}\tilde{\psi}^{2}_{ij}=\Big{[}\Big{(}\frac{tr({\boldsymbol{\Sigma}})}{p}\Big{)}^{2}+\frac{tr({\boldsymbol{\Sigma}}^{2})}{p^{2}}\Big{]}tr(\boldsymbol{\Psi}^{2})+\frac{tr({\boldsymbol{\Sigma}}^{2})}{p^{2}}(tr(\boldsymbol{\Psi}))^{2}$ . Moreover,

[TABLE]

where the last equation is because

[TABLE]

So we have

[TABLE]

Due to the fact that $\frac{\gamma_{n}}{n}=\left(\frac{tr({\boldsymbol{\Sigma}})}{p}\right)^{2}\frac{\|{\boldsymbol{\Psi}}\|^{2}_{\mathrm{F}}}{n}\geq\left(\min_{1\leq j\leq p}\lambda^{2}_{j}\right)\left(\min_{1\leq i\leq n}\nu^{2}_{i}\right)>0$ and $tr({\boldsymbol{\Sigma}}^{2})/p\leq\max_{1\leq i\leq p}\lambda^{2}_{i}$ , we can obtain that $a_{n}\leq 1+c_{1}n/p$ for some constant $c_{1}$ . This proves that $a_{n}$ satisfies the inequality in the lemma.

It remains to calculate $d_{n}$ . Recall that $(\eta_{1k},\ldots,\eta_{nk})$ , $1\leq k\leq p$ , are independent $N(0,\boldsymbol{\Psi})$ random vectors. As the proof of (39), we can write

[TABLE]

where $\{\varepsilon_{ik},1\leq i\leq n,1\leq k\leq p\}$ are some i.i.d. $N(0,1)$ random variables. By the definition of $\tilde{\psi}_{ij}$ , we have the following equations:

[TABLE]

Then

[TABLE]

Due to the symmetry between the indices $(k_{1},l_{1})$ and $(k_{2},l_{2})$ , we only need to consider seven cases for the indices in the above sums: (1) all $k_{1},k_{2},l_{1},l_{2}$ are different; (2) $k_{1}=k_{2}$ , $l_{1}\neq l_{2}$ and $k_{1}=k_{2}\neq l_{1},l_{2}$ ; (3) $k_{1}=k_{2}$ , $l_{1}=l_{2}$ and $k_{1}\neq l_{1}$ ; (4) $k_{1}=k_{2}=l_{1}\neq l_{2}$ ; (5) $k_{1}=k_{2}=l_{1}=l_{2}$ ; (6) $k_{1}\neq k_{2}$ , $l_{1}\neq l_{2}$ , $k_{1}=l_{1}$ and $k_{2}=l_{2}$ ; (7) $k_{1}\neq k_{2}$ , $l_{1}\neq l_{2}$ , $k_{1}=l_{1}$ , $k_{2}\neq l_{2}$ and $k_{1}\neq l_{2}$ . For (1), we have $\mathbb{E}[S_{k_{1}l_{1}}S_{k_{2}l_{2}}]=0$ . For case (2) we have

[TABLE]

and

[TABLE]

This shows that

[TABLE]

For case (3), we have

[TABLE]

where the first equation follows from the observation that given $\{\varepsilon_{ik_{1}}\}$ , $\sum_{i=1}^{n}\nu_{i}\varepsilon_{ik_{1}}\varepsilon_{il_{1}}$ is normal distributed with mean zero and variance $\sum_{i=1}^{n}\nu^{2}_{i}\varepsilon^{2}_{ik_{1}}$ , and $\mathbb{E}(N(0,\sigma^{2}))^{4}=3\sigma^{4}$ . Therefore

[TABLE]

For case (4), we have

[TABLE]

and

[TABLE]

Therefore, $|\mathbb{E}S_{k_{1}l_{1}}S_{k_{1}l_{2}}|\leq 2tr(\boldsymbol{\Psi}^{3})tr(\boldsymbol{\Psi})$ and

[TABLE]

Note that

[TABLE]

So for case (5), we have

[TABLE]

for some universal constant $C$ . This implies that

[TABLE]

For case (6), we have

[TABLE]

So $\mathbb{E}[S_{k_{1}l_{1}}S_{k_{2}l_{2}}]=0$ . Similarly, for case (7), we have $\mathbb{E}[S_{k_{1}l_{1}}S_{k_{2}l_{2}}]=0$ . Combining (77)-(82), we have $\mathrm{Var}\Big{(}\sum_{i=1}^{n}\sum_{j=1}^{n}\tilde{\psi}^{2}_{ij}\Big{)}=O(n/p)$ .

Now we calculate $\mathrm{Var}((\sum_{i=1}^{n}\tilde{\psi}_{ii})^{2})$ . We have

[TABLE]

Hence

[TABLE]

For case (1), $\mathbb{E}[U_{k_{1}l_{1}}U_{k_{2}l_{2}}]-\mathbb{E}[U_{k_{1}l_{1}}\mathbb{E}U_{k_{2}l_{2}}]=0$ . By $\mathbb{E}\Big{(}\sum_{i=1}^{n}\nu_{i}\varepsilon^{2}_{ik}\Big{)}^{2}=2tr(\boldsymbol{\Psi}^{2})+(tr(\boldsymbol{\Psi}))^{2}$ , for case (2), we have

[TABLE]

For case (3), we have

[TABLE]

Note that

[TABLE]

For case (4), we have

[TABLE]

For case (5), we have

[TABLE]

For cases (6) and (7), $\mathbb{E}[U_{k_{1}l_{1}}U_{k_{2}l_{2}}]-\mathbb{E}[U_{k_{1}l_{1}}]\mathbb{E}[U_{k_{2}l_{2}}]=0$ . So $\mathrm{Var}((\sum_{i=1}^{n}\tilde{\psi}_{ii})^{2})=O(n^{2}/p)$ .

As $n^{2}\mathbb{E}(d_{n}^{2})\leq 2\mathrm{Var}\Big{(}\sum_{i=1}^{n}\sum_{j=1}^{n}\tilde{\psi}^{2}_{ij}\Big{)}+\frac{2}{p^{2}}\mathrm{Var}((\sum_{i=1}^{n}\tilde{\psi}_{ii})^{2})$ and $p\geq cn$ for some $c>0$ , we see that $\mathbb{E}(d_{n}^{2})=O(1/(np))$ .

C Proof of results from Section 3

C.1 Proof of Proposition 3.1

Without loss of generality, we assume that $\boldsymbol{\mu}=0$ and $\sigma_{ii}=1$ for $1\leq i\leq n$ . Thus, $\rho_{ij}=\sigma_{ij}$ for all $i,j$ . Define

[TABLE]

where $\tilde{\sigma}_{ij}=\frac{1}{n-1}\sum_{k=1}^{n}X_{ki}X_{kj}$ . We first show that

[TABLE]

Write

[TABLE]

We have $|\Pi_{ij,2}+\Pi_{ij,3}|=O_{\mathbb{P}}(1/n)$ . For $\Pi_{ij,1}$ , by (42),

[TABLE]

Since $(\xi_{k1},\ldots,\xi_{kp})$ , $1\leq k\leq n$ are independent $N(0,{\boldsymbol{\Sigma}})$ random vectors, it is easy to check that $\mathrm{Var}\left(\xi_{ki}\xi_{kj}-\sigma_{ij}-\frac{1}{2}(\xi^{2}_{ki}+\xi^{2}_{kj}-2)\sigma_{ij}\right)=(1-\rho^{2}_{ij})^{2}$ . So we have

[TABLE]

This proves (83). We have by (41),

[TABLE]

where the last equation follows from $\mathrm{Var}(\bar{X}_{i})=o(1/\sqrt{n})$ and $\hat{\sigma}_{ii}=\tilde{\sigma}_{ii}-\frac{n}{n-1}\bar{X}^{2}_{i}$ . The proposition is proved.∎

C.2 Proof of Theorem 3.2

Without loss of generality, we assume that $\boldsymbol{\mu}=0$ . Recall that $\hat{\rho}_{ij,Y}=\hat{\sigma}_{ij,Y}/\sqrt{\hat{\sigma}_{ii,Y}\hat{\sigma}_{jj,Y}}$ , where $\hat{\sigma}_{ij,Y}=\frac{1}{n}(\textbf{X}_{\cdot,i}-\bar{X}_{i}\textbf{1})^{\prime}\hat{\boldsymbol{\Gamma}}(\textbf{X}_{\cdot,j}-\bar{X}_{j}\textbf{1})$ and $\tilde{\rho}_{ij,Y}=\tilde{\sigma}_{ij,Y}/\sqrt{\tilde{\sigma}_{ii,Y}\tilde{\sigma}_{jj,Y}}$ , where $\tilde{\sigma}_{ij,Y}=\frac{1}{n}(\textbf{X}_{\cdot,i})^{\prime}\boldsymbol{\Psi}^{-1}(\textbf{X}_{\cdot,j})$

We first show that

[TABLE]

By (67), we have

[TABLE]

where $a_{n}=\frac{1}{n}\max_{1\leq i\leq n}|\sum_{j=1,\neq i}^{n}\psi_{ij}|\leq\frac{N_{n}}{n}$ and $b_{n,p}=\sqrt{\frac{\log n}{p}}$ . By (59), we have

[TABLE]

Combing these implies that

[TABLE]

By the proof of Theorem 6 in Cai, Liu and Luo (2011), we can show that $\|\hat{\boldsymbol{\Gamma}}-\frac{p}{tr({\boldsymbol{\Sigma}})}\boldsymbol{\Psi}^{-1}\|_{l_{1}}=O_{\mathbb{P}}(M^{2-2q}_{n}s_{n}(a_{n}+b_{n,p})^{1-q})$ , where $\|\cdot\|_{l_{1}}$ denotes the matrix $\ell_{1}$ -norm. Due to the tail probability of normal distribution, we have $\max_{1\leq k\leq n}\max_{1\leq i\leq p}|X_{ki}-\bar{X}_{i}|=O_{\mathbb{P}}(\sqrt{\log p})$ . So we have

[TABLE]

Note that $\mathrm{Var}(\bar{X}_{i})=\frac{\sum_{1\leq k,l\leq n}\psi_{kl}\sigma_{ii}}{n^{2}}$ and $\mathrm{Var}(\textbf{1}^{\prime}\boldsymbol{\Psi}^{-1}\textbf{X}_{\cdot,i})=\textbf{1}^{\prime}\boldsymbol{\Psi}^{-1}\textbf{1}\sigma_{ii}=O(n)$ . By the tail probability of normal distribution, we have $\max_{1\leq i\leq p}|\bar{X}_{i}|=O_{\mathbb{P}}\Big{(}\sqrt{\frac{N_{n}}{n}\log p}\Big{)}$ and $\max_{1\leq i\leq p}|\textbf{1}^{\prime}\boldsymbol{\Psi}^{-1}\textbf{X}_{\cdot,i}|=O_{\mathbb{P}}(\sqrt{n\log p})$ . Therefore

[TABLE]

Combining the above arguments, we prove (84). By (65), when $(i,j)\in\mathcal{H}_{0}$ (i.e., $\sigma_{ij}=0$ ), we have $\sqrt{n}\tilde{\sigma}_{ij,Y}=O_{\mathbb{P}}(\sqrt{\log p})$ . Therefore, we have,

[TABLE]

where the last equation is due to $\max_{1\leq i\leq p}|\tilde{\sigma}_{ii}-\sigma_{ii}|=O_{\mathbb{P}}\left(\sqrt{\frac{\log p}{n}}\right)$ .

The remaining proof closely follows the proof of Theorem 3.1 in Liu (2013). Following the notations in Liu (2013), let $G(t)=2-2\Phi(t)$ and

[TABLE]

By the continuity of $G(t)$ and monotonicity of both $G(t)$ and the sum of indicator functions in the denominator in (34), we have

[TABLE]

and Liu (2013) further proved that $\mathbb{P}(0\leq\hat{t}\leq b_{p})\rightarrow 1$ . By (88), we have

[TABLE]

where $h=h_{0}+h_{1}=(p^{2}-p)/2$ , $h_{0}=Card(\mathcal{H}_{0})$ and $h_{1}=Card(\mathcal{H}_{1})$ . To prove that $\frac{\text{FDP}}{\alpha h_{0}/h}\rightarrow 1$ in probability as $(n,p)\rightarrow\infty$ , it suffices to show that

[TABLE]

in probability. By (84) and (86), we only need to show that

[TABLE]

in probability. By Lemma 6.3 in Liu (2013) (taking $U_{ij}$ in Lemma 6.3 as $\frac{\sqrt{n}\tilde{\sigma}_{ij,Y}}{\sqrt{\sigma_{ii}\sigma_{jj}}}$ ) and following the proof of equations (31) and (32) in Liu (2013), the convergence in (89) holds and the proof of Theorem 3.2 is completed. ∎

D Additional Experiments

In this section, we report the additional simulation results and real data experiments.

D.1 Type I error rates when using a Monte-Carlo simulation based critical value

As explained in the main text, It has been noted in Liu, Lin and Shao (2008) that the rate of convergence in distribution for the max-type test statistic is typically slow. Therefore, when using the critical value $q_{\alpha}+4\log n-\log\log n$ based on the limiting distribution, the testing procedure is conservative in the sense that the empirical size could be smaller than the pre-specified significance level $\alpha$ . To mitigate this problem, one can use simulated quantile as the critical value for the proposed test statistic $\hat{T}_{n,p}$ . In particular, we generate 10,000 replications of $p\times n$ data matrix, where each one is randomly drawn from $N(\textbf{0},\mathbf{I}_{p\times p}\otimes\mathbf{I}_{n\times n})$ under the null. We compute the corresponding test statistics $\hat{T}^{(i)}_{n,p}$ , $1\leq i\leq 10000$ , for each randomly generated data matrix. By Theorem 2.4, we have $\mathbb{P}(\hat{T}^{(i)}_{n,p}-4\log n+\log\log n\leq t)\rightarrow\exp\Big{(}-\frac{1}{\sqrt{8\pi}}\exp\Big{(}-\frac{t}{2}\Big{)}\Big{)}$ and hence $\mathbb{P}(\hat{T}^{(i)}_{n,p}\leq t)-\mathbb{P}(\hat{T}_{n,p}\leq t)\rightarrow 0$ uniformly in $i$ and $t\in\mathbb{R}$ . Let $c_{\alpha}$ be the $(1-\alpha)$ -quantile of the empirical distribution $\frac{1}{10000}\sum_{i=1}^{10000}I\{\hat{T}^{(i)}_{n,p}\leq t\}$ . We reject the null whenever the obtained test statistic $\hat{T}_{n,p}\geq c_{\alpha}$ . By comparing Table 5 using the simulation based critical value to Table 2 in the main text using the limiting distribution based critical value, one can see that using a simulated quantile as the critical value for $\hat{T}_{n,p}$ will make the empirical size more closer to the pre-specified $\alpha$ .

D.2 Power comparisons for diagonal block $\boldsymbol{\Psi}$

We compare empirical powers when the $\boldsymbol{\Psi}$ is block diagonal matrix with the block size either 5 (Figure 4(a)–4(c)) or 10 (Figure 4(d)–4(f)). For each block, the off-diagonal elements all take the value $\rho$ , which quantifies the correlation strength among samples and varies from 0.2 to 0.5. As we can see from 4, when the block size is 5, the empirical powers of both CV and our method are always 1 for different settings of ${\boldsymbol{\Sigma}}$ and are much higher than Bai and CQ. When the block size is 10, the Bai and CQ have no statistical power while our method still achieves 100% power and outperforms the CV method.

D.3 Empirical power for “sparsely” correlated samples

We consider the case of extremely sparse $\boldsymbol{\Psi}$ where $\psi_{12}=\psi_{21}=\kappa\sqrt{\frac{\log n}{p}}$ and all the other off-diagonal elements are zeros. We plot the empirical power of the proposed test statistic $\hat{T}_{n,p}$ in terms of the signal strength $\kappa$ in Figure 5 with $n=50$ , $p=1,000$ . As we can see from Figure 5, for different types of ${\boldsymbol{\Sigma}}$ , the empirical powers all become 100% as $\kappa$ increases, which empirically verifies the result in Theorem 2.5 and shows that our test statistic can successfully reject the null even when the $\boldsymbol{\Psi}$ is extremely sparse. In addition, as observed from Figure 5, using simulated quantile as the critical value leads to a slightly improved power as compared to using the quantile from the limiting null distribution.

D.4 Comparisons between different estimators for estimating $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$

In this section, we conduct simulations on the comparisons between the different approaches for estimating $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$ .

We fix $p=1000$ and vary $n$ from 50 to 200 and plot the relative estimation error $\frac{|\widehat{\|{\boldsymbol{\Sigma}}\|_{F}^{2}}-\|{\boldsymbol{\Sigma}}\|_{F}^{2}|}{\|{\boldsymbol{\Sigma}}\|_{F}^{2}}$ . In Figure 6(a) to 6(c), we consider the case when samples are i.i.d. As we can see, for very small sample size $n=50$ , the method by Chen and Qin (2010) performs the best. As sample size becomes larger, the performance of our method with the estimated $\hat{B}_{n}$ matches method by Chen and Qin (2010) and is superior to the CV approach. On the other hand, when samples are correlated (see Figure 6(d) to 6(i)), the plugin estimator based on thresholded ${\boldsymbol{\Sigma}}$ outperforms the methods by Bai and Saranadasa (1996) and Chen and Qin (2010). When sample correlation becomes large (e.g., $\psi_{ij}=0.8^{|i-j|}$ ), our approach greatly outperforms the CV approach.

D.5 Real Experiments on Large-scale Multiple Testing of Correlations

In this section, we conduct real data analysis for the large-scale multiple testing of correlations. The first data is a yeast genomics data set generated by Brem and Kruglyak (2005), which contains $n=112$ yeast segregants grown from a cross involving BY4716 and wild isolate RM11-1a. We use a set of $p=1207$ genes of the protein-protein interaction network from Steffen et al. (2002). Please refer to Cai et al. (2013) for more detailed description of the data. The data is standardized with sample mean zero and row sample standard deviation one. In Table 6, we compare the number of rejection/discoveries of the BH procedure based on the proposed sandwich estimator of $\sqrt{n}\hat{\rho}_{ij,Y}$ and sample correlation at different levels of significance. As we can see from Table 6, the number of rejections for the sandwich estimator is much smaller than that for the sample correlation estimator, which suggests that there might be many false positives when using the sample correlation estimator. We also show the obtained correlation graph in Figure 7, where each node corresponds to a gene and node $i$ and node $j$ are connected if $H_{0ij}$ is rejected. From Figure 7, one can see several clusters or small dense subgraphs, which indicates that the genes in each cluster are highly correlated.

The second real data set is monthly returns of 258 large stocks from Standard & Poor 500 (S&P 500), which are available between January 1990 and December 2012. We first fit the Fama-French three factor model (Fama and French, 1993):

[TABLE]

where $i$ is the index of stock and $t$ is the index of each month. At the $t$ -th month, $r_{it}$ is the return for stock $i$ , $r_{ft}$ is the risk free return rate, MKT, SMB and HML are market, size and value factors at time $t$ , and $u_{it}$ is the noise. Please refer to Section 5.3 in Fan, Rigollet and Wang (2015) for more details. We investigate the correlation structure among $p=258$ stocks based on the fitted residuals. In Table 7, we present the number of rejections of the BH procedure based on the proposed sandwich estimator of $\sqrt{n}\hat{\rho}_{ij,Y}$ and sample correlation. Similar to the case in Table 6 for yeast data, our estimator leads to fewer discoveries and more sparse correlation graphs at all levels of significance. In Figure 7, we plot the correlation graph for non-isolated nodes/stocks (the isolated nodes are omitted for better visualization). We further list the top 5 pairs of most correlated stocks with the largest $|\hat{\rho}_{ij,Y}|$ in Table 8. From Table 8, it is easy to see that businesses for all these 5 pairs of stocks are closely related. For example, an important business of BMC software is to provide solutions to health care industrials, which explains the reason why BMC and McKesson are highly correlated. In fact, the top 5 stocks with the largest degree in correlations graph in both Figure 8(a) and 8(b) are BMC software, Automatic Data Processing, McKesson, Sealed Air Corp and American International Group. All of them have a wide range of businesses and thus are expected to be correlated with many other companies.

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allen and Tibshirani (2012) {barticle} [author] \bauthor \bsnm Allen, \bfnm G. I. \binits G. I. and \bauthor \bsnm Tibshirani, \bfnm R. \binits R. ( \byear 2012). \btitle Inference with transposable data: modelling the effects of row and column correlations. \bjournal J. R. Statist. Soc. B \bvolume 74 \bpages 721–734. \endbibitem
2Anderson (2003) {bbook} [author] \bauthor \bsnm Anderson, \bfnm T. W. \binits T. W. ( \byear 2003). \btitle An introduction to multivariate statistical analysis, \bedition 3rd ed. \bpublisher Wiley. \endbibitem
3Arratia, Goldstein and Gordon (1989) {barticle} [author] \bauthor \bsnm Arratia, \bfnm R. \binits R., \bauthor \bsnm Goldstein, \bfnm L. \binits L. and \bauthor \bsnm Gordon, \bfnm L. \binits L. ( \byear 1989). \btitle Two moments suffice for poisson approximations: the Chen-Stein method. \bjournal Annals of Probability \bvolume 17 \bpages 9–25. \endbibitem
4Bai and Saranadasa (1996) {barticle} [author] \bauthor \bsnm Bai, \bfnm Z. \binits Z. and \bauthor \bsnm Saranadasa, \bfnm H. \binits H. ( \byear 1996). \btitle Effect of high dimension: by an example of a two sample problem. \bjournal Statistica Sinica \bvolume 6 \bpages 311–329. \endbibitem
5Bai et al. (2009) {barticle} [author] \bauthor \bsnm Bai, \bfnm Zhidong \binits Z., \bauthor \bsnm Jiang, \bfnm Dandan \binits D., \bauthor \bsnm Yao, \bfnm Jianfeng \binits J. and \bauthor \bsnm Zheng, \bfnm Shurong \binits S. ( \byear 2009). \btitle Corrections to LRT on large-dimensional covariance matrix by RMT. \bjournal Annals of Statistics \bvolume 37 \bpages 3822–3840. \endbibitem
6Baraud (2002) {barticle} [author] \bauthor \bsnm Baraud, \bfnm Y. \binits Y. ( \byear 2002). \btitle Non-asymptotic minimax rates of testing in signal detection. \bjournal Bernoulli \bvolume 8 \bpages 577–606. \endbibitem
7Benjamini and Hochberg (1995) {barticle} [author] \bauthor \bsnm Benjamini, \bfnm Yoav \binits Y. and \bauthor \bsnm Hochberg, \bfnm Yosef \binits Y. ( \byear 1995). \btitle Controlling the false discovery rate: a practical and powerful approach to multiple testing. \bjournal J. R. Statist. Soc. B \bvolume 57 \bpages 389–300. \endbibitem
8Bien and Tibshirani (2011) {barticle} [author] \bauthor \bsnm Bien, \bfnm J. \binits J. and \bauthor \bsnm Tibshirani, \bfnm R. \binits R. ( \byear 2011). \btitle Sparse estimation of a covariance matrix. \bjournal Biometrika \bvolume 98 \bpages 807–820. \endbibitem

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Testing independence with high-dimensional correlated samples

Abstract

keywords:

keywords:

1 Introduction

2 Sample independence test

2.1 Construction of the test statistic

2.2 Estimation of ApA_{p}Ap​ and ∥Σ∥F2\|{{\boldsymbol{\Sigma}}}\|^{2}_{\text{F}}∥Σ∥F2​ from correlated samples

Theorem 2.1**.**

Proposition 2.2**.**

2.3 Type I error rate control and optimality of statistical power

Proposition 2.3**.**

Theorem 2.4**.**

Theorem 2.5**.**

Theorem 2.6**.**

Theorem 2.7**.**

3 Multiple testing of correlations with correlated observations

Proposition 3.1**.**

Theorem 3.2**.**

4 Numerical results

4.1 Independence Test

4.2 Large-scale Multiple Testing of Correlations

4.2.1 FDP and Power of Simulated Results

5 Discussion

A Proof of the results in Section 2 for sample independence test

A.1 Proof of Theorem 2.1

Lemma A.1**.**

Lemma A.2**.**

A.2 Proof of Proposition 2.2

A.3 Proof of Proposition 2.3 and Theorem 2.4

A.4 Proof of Theorem 2.5

A.5 Proof of Theorems 2.6 and 2.7

B Proof of technical lemmas

B.1 Proof of Lemma A.1

B.2 Proof of Lemma A.2

C Proof of results from Section 3

C.1 Proof of Proposition 3.1

C.2 Proof of Theorem 3.2

D Additional Experiments

D.1 Type I error rates when using a Monte-Carlo simulation based critical value

D.2 Power comparisons for diagonal block Ψ\boldsymbol{\Psi}Ψ

D.3 Empirical power for “sparsely” correlated samples

D.4 Comparisons between different estimators for estimating ∥Σ∥F2\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}∥Σ∥F2​

D.5 Real Experiments on Large-scale Multiple Testing of Correlations

2.2 Estimation of $A_{p}$ and $\|{{\boldsymbol{\Sigma}}}\|^{2}_{\text{F}}$ from correlated samples

Theorem 2.1.

Proposition 2.2.

Proposition 2.3.

Theorem 2.4.

Theorem 2.5.

Theorem 2.6.

Theorem 2.7.

Proposition 3.1.

Theorem 3.2.

Lemma A.1.

Lemma A.2.

D.2 Power comparisons for diagonal block $\boldsymbol{\Psi}$

D.4 Comparisons between different estimators for estimating $\|{\boldsymbol{\Sigma}}\|_{\text{F}}^{2}$