Large-scale inference with block structure

Jiyao Kou; Guenther Walther

arXiv:1907.00085·math.ST·May 10, 2022

Large-scale inference with block structure

Jiyao Kou, Guenther Walther

PDF

Open Access

TL;DR

This paper demonstrates that exploiting block structure in large-scale data significantly enhances the power of statistical inference, enabling detection of weaker signals than traditional methods that ignore such structure.

Contribution

The authors derive the detection boundary for structured signals and develop an adaptive methodology that exploits block structure without penalty when absent.

Findings

01

Detection boundary improves with block structure exploitation.

02

Method achieves optimal adaptive detection in various settings.

03

Detection of weaker signals is possible when structure is utilized.

Abstract

The detection of weak and rare effects in large amounts of data arises in a number of modern data analysis problems. Known results show that in this situation the potential of statistical inference is severely limited by the large-scale multiple testing that is inherent in these problems. Here we show that fundamentally more powerful statistical inference is possible when there is some structure in the signal that can be exploited, e.g. if the signal is clustered in many small blocks, as is the case in some relevant applications. We derive the detection boundary in such a situation where we allow both the number of blocks and the block length to grow polynomially with sample size. We derive these results both for the univariate and the multivariate settings as well as for the problem of detecting clusters in a network. These results recover as special cases the sparse mixture detection…

Equations245

X_{i} = μ 1_{{ℓ_{1}, \dots, ℓ_{m}}} (i) + Z_{i}, i = 1, \dots, n,

X_{i} = μ 1_{{ℓ_{1}, \dots, ℓ_{m}}} (i) + Z_{i}, i = 1, \dots, n,

ρ^{*} (β) = {β - \frac{1}{2} (1 - 1 - β)^{2} \frac{1}{2} < β \leq \frac{3}{4} \frac{3}{4} < β \leq 1

ρ^{*} (β) = {β - \frac{1}{2} (1 - 1 - β)^{2} \frac{1}{2} < β \leq \frac{3}{4} \frac{3}{4} < β \leq 1

H C_{n} = 1 \leq i \leq \frac{n}{2} max n \frac{\frac{i}{n} - p _{(i)}}{p _{(i)} ( 1 - p _{(i)} )},

H C_{n} = 1 \leq i \leq \frac{n}{2} max n \frac{\frac{i}{n} - p _{(i)}}{p _{(i)} ( 1 - p _{(i)} )},

B J_{n} = 1 \leq i \leq \frac{n}{2} max (i lo g \frac{i}{n p _{(i)}} + (n - i) lo g \frac{1 - \frac{i}{n}}{1 - p _{(i)}}) 1 (p_{(i)} < \frac{i}{n}),

B J_{n} = 1 \leq i \leq \frac{n}{2} max (i lo g \frac{i}{n p _{(i)}} + (n - i) lo g \frac{1 - \frac{i}{n}}{1 - p _{(i)}}) 1 (p_{(i)} < \frac{i}{n}),

ρ^{*} (β) = β - \frac{1}{2}

ρ^{*} (β) = β - \frac{1}{2}

X_{i} = μ 1_{⋃_{g = 1}^{m} I_{g}} (i) + Z_{i}, i = 1, \dots, n

X_{i} = μ 1_{⋃_{g = 1}^{m} I_{g}} (i) + Z_{i}, i = 1, \dots, n

μ = μ (n) = 2 r lo g n / n^{α}

μ = μ (n) = 2 r lo g n / n^{α}

ρ^{*} (α, β) = {β - (1 - α) /2 (1 - α - 1 - α - β)^{2} if β / (1 - α) < \frac{3}{4} if β / (1 - α) \geq \frac{3}{4}

ρ^{*} (α, β) = {β - (1 - α) /2 (1 - α - 1 - α - β)^{2} if β / (1 - α) < \frac{3}{4} if β / (1 - α) \geq \frac{3}{4}

μ = μ (n) = n^{r} / n^{α}

μ = μ (n) = n^{r} / n^{α}

ρ^{*} (α, β) = β - \frac{1 - α}{2} .

ρ^{*} (α, β) = β - \frac{1 - α}{2} .

\mathcal{I}_{\mathrm{app}}(\ell):=\Bigl{\{}(j,k]\subset(0,n]:j,k\in\{id_{\ell},i=0,1,\ldots\}\text{ and }2^{\ell-1}<k-j\leq 2^{\ell}\Bigr{\}}

\mathcal{I}_{\mathrm{app}}(\ell):=\Bigl{\{}(j,k]\subset(0,n]:j,k\in\{id_{\ell},i=0,1,\ldots\}\text{ and }2^{\ell-1}<k-j\leq 2^{\ell}\Bigr{\}}

sH C_{n} = ℓ = 0 max ℓ_{max} \frac{n}{2 ^{ℓ} n _{ℓ}} H C_{n_{ℓ}} (ℓ),

sH C_{n} = ℓ = 0 max ℓ_{max} \frac{n}{2 ^{ℓ} n _{ℓ}} H C_{n_{ℓ}} (ℓ),

s B J_{n} = ℓ = 0 max ℓ_{max} \frac{n}{2 ^{ℓ} n _{ℓ}} B J_{n_{ℓ}} (ℓ),

s B J_{n} = ℓ = 0 max ℓ_{max} \frac{n}{2 ^{ℓ} n _{ℓ}} B J_{n_{ℓ}} (ℓ),

n_{ℓ} := # I_{app} (ℓ)

n_{ℓ} := # I_{app} (ℓ)

s S_{n} (s) = ℓ = 0 max ℓ_{max} \frac{n}{2 ^{ℓ} n _{ℓ}} S_{n_{ℓ}}^{+} (s, ℓ)

s S_{n} (s) = ℓ = 0 max ℓ_{max} \frac{n}{2 ^{ℓ} n _{ℓ}} S_{n_{ℓ}}^{+} (s, ℓ)

\displaystyle\frac{sBJ_{n}}{\log\log n}\

\displaystyle\frac{sBJ_{n}}{\log\log n}\

\displaystyle sHC_{n}\

P (t \in [a, b] sup \frac{U ( t )}{t ( 1 - t )} > η) \leq \frac{\frac{2}{η} + η lo g \frac{b ( 1 - a )}{a ( 1 - b )}}{2 π} e^{- η^{2} /2}

P (t \in [a, b] sup \frac{U ( t )}{t ( 1 - t )} > η) \leq \frac{\frac{2}{η} + η lo g \frac{b ( 1 - a )}{a ( 1 - b )}}{2 π} e^{- η^{2} /2}

P (t \in [a, b] sup n (F_{n} (t) \frac{lo g F _{n} ( t )}{t} + (1 - F_{n} (t)) lo g \frac{1 - F _{n} ( t )}{1 - t}) > η)

P (t \in [a, b] sup n (F_{n} (t) \frac{lo g F _{n} ( t )}{t} + (1 - F_{n} (t)) lo g \frac{1 - F _{n} ( t )}{1 - t}) > η)

\leq 2 e (η lo g \frac{b ( 1 - a )}{a ( 1 - b )} + 1) exp (- η)

P_{H_{0}} (B J_{n} > η) \leq 22 K (lo g n) (η + 1) exp (- η) + 2 n^{1 - K}

P_{H_{0}} (B J_{n} > η) \leq 22 K (lo g n) (η + 1) exp (- η) + 2 n^{1 - K}

P_{H_{0}} (H C_{n} > η) \leq P (t \in (0, 1) sup n \frac{F _{n} ( t ) - t}{t ( 1 - t )} > η) \leq \frac{C}{η}

P_{H_{0}} (H C_{n} > η) \leq P (t \in (0, 1) sup n \frac{F _{n} ( t ) - t}{t ( 1 - t )} > η) \leq \frac{C}{η}

X_{i} = μ 1_{I_{n}} (i) + Z_{i}, i = 1, \dots, n

X_{i} = μ 1_{I_{n}} (i) + Z_{i}, i = 1, \dots, n

ρ^{*} (1 - β, β) = β = 1 - α .

ρ^{*} (1 - β, β) = β = 1 - α .

μ > (1 + ϵ) 2 (1 - α) lo g n / n^{α} = (1 + ϵ) 2 lo g \frac{n}{∣ I _{n} ∣} / ∣ I_{n} ∣,

μ > (1 + ϵ) 2 (1 - α) lo g n / n^{α} = (1 + ϵ) 2 lo g \frac{n}{∣ I _{n} ∣} / ∣ I_{n} ∣,

ρ_{H C}^{*} (α, β) = {(β - \frac{1}{2}) n^{α} (1 - 1 - β)^{2} n^{α} \frac{1}{2} < β < \frac{3}{4} β \geq \frac{3}{4} . .

ρ_{H C}^{*} (α, β) = {(β - \frac{1}{2}) n^{α} (1 - 1 - β)^{2} n^{α} \frac{1}{2} < β < \frac{3}{4} β \geq \frac{3}{4} . .

ρ_{H C}^{*} (α, β) = β - \frac{1 - α}{2} .

ρ_{H C}^{*} (α, β) = β - \frac{1 - α}{2} .

P_{n} = 0 \leq j < k \leq n max (\frac{\sum _{i = j + 1}^{k} X _{i}}{k - j} - 2 lo g \frac{e n}{k - j})

P_{n} = 0 \leq j < k \leq n max (\frac{\sum _{i = j + 1}^{k} X _{i}}{k - j} - 2 lo g \frac{e n}{k - j})

P_{n}^{app} = I \in ⋃_{ℓ = 0}^{ℓ_{max}} I_{app} (ℓ) max (\frac{\sum _{i \in I} X _{i}}{∣ I ∣} - 2 lo g \frac{e n}{∣ I ∣})

P_{n}^{app} = I \in ⋃_{ℓ = 0}^{ℓ_{max}} I_{app} (ℓ) max (\frac{\sum _{i \in I} X _{i}}{∣ I ∣} - 2 lo g \frac{e n}{∣ I ∣})

ρ_{pen}^{*} (α, β) = {β - (1 - α) /2 (1 - α - 1 - α - β)^{2} if β / (1 - α) < \frac{1}{2}, if β / (1 - α) > \frac{1}{2},

ρ_{pen}^{*} (α, β) = {β - (1 - α) /2 (1 - α - 1 - α - β)^{2} if β / (1 - α) < \frac{1}{2}, if β / (1 - α) > \frac{1}{2},

X_{ij} = μ 1_{\cup_{g = 1}^{m} I_{g}} (i, j) + Z_{ij}, i, j = 1, \dots, n,

X_{ij} = μ 1_{\cup_{g = 1}^{m} I_{g}} (i, j) + Z_{ij}, i, j = 1, \dots, n,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Process Monitoring · Statistical Methods and Inference · Statistical Methods and Bayesian Inference

Full text

Large-scale inference with block structure

Jiyao Kou and Guenther Walther

Stanford University Email addresses: [email protected], [email protected]. Supported by NSF grants DMS-1220311 and DMS-1501767

(December 2021)

Abstract

The detection of weak and rare effects in large amounts of data arises in a number of modern data analysis problems. Known results show that in this situation the potential of statistical inference is severely limited by the large-scale multiple testing that is inherent in these problems. Here we show that fundamentally more powerful statistical inference is possible when there is some structure in the signal that can be exploited, e.g. if the signal is clustered in many small blocks, as is the case in some relevant applications. We derive the detection boundary in such a situation where we allow both the number of blocks and the block length to grow polynomially with sample size. We derive these results both for the univariate and the multivariate settings as well as for the problem of detecting clusters in a network. These results recover as special cases the sparse signal detection problem Donoho and Jin, (2004) where there is no structure in the signal, as well as the scan problem Chan and Walther, (2013) where the signal comprises a single interval. We develop methodology that allows optimal adaptive detection in the general setting, thus exploiting the structure if it is present without incurring a relevant penalty in the case where there is no structure. The advantage of this methodology can be considerable, as in the case of no structure the means need to increase at the rate $\sqrt{\log n}$ to ensure detection, while the presence of structure allows detection even if the means decrease at a polynomial rate.

Keywords: Block structure; heterogeneous mixture detection; sparse signal detection; structured higher criticism; structured Berk-Jones statistic; structured $\phi$ -divergence; tail bound for supremum of standardized Brownian Bridge; tail bound for supremum of binomial log likelihood ratio process; tail bound for higher criticism statistic and Berk-Jones statistic

MSC 2000 subject classifications. Primary 62G10; secondary 62G32.

1 Introduction

The problem of detecting a signal, such as an elevated mean, in a high-dimensional vector of Gaussian observations has been of considerable interest as it serves as the statistical model for the multiple testing of a large number of hypotheses. This problem has been studied in detail for the important setting where the signal is sparse and weak, see Section 1.1. An important result of this research is that detection of the signal is impossible unless the signal mean is at least of the order $\sqrt{\log n}$ , where $n$ is the sample size. This is a somewhat discouraging result in the context of typical statistical inference problems, where a larger sample size usually allows to detect a smaller mean. In fact, there is an earlier body of research that considers the above detection problem in the case where the signal is aligned consecutively in an interval rather than scattered at random. It can be shown that in this “block signal detection problem” it is possible to detect much smaller means: scan statistics can detect means that are sparse and weak and yet may decrease at a rate that is polynomial in $n$ , see Section 5 below. The stark contrast between these two results suggests that it may be possible to perform statistical inference in the sparse and weak setting that is more powerful in a fundamental and relevant way, provided there is some kind of structure in the signal that can be exploited.

This paper develops methodology that is adaptive to such structure, i.e. it automatically exploits structure that may be present in the data. We consider a model where the signal is comprised of potentially many small blocks that are scattered at random in the sequence, the “multiple blocks detection problem”. Two examples of such data are:

(i)

(Epidemic) Each location represents a one kilometer by one kilometer square and the proportion of citizens that is diseased is measured in each location. When there is a disease outbreak in the city, many independent “areas” will have unusual high values, where “area” is defined as a two-dimensional block of locations. The task is to detect whether there is a disease outbreak or not; see, for example, Kulldorff, (1999); Gangnon and Clayton, (2001); Chan, (2009); Walther, (2010). In this example, the structure is “spatial”.

(ii)

(Financial) On each timestamp, we measure the predictive power of a particular technical indicator for S&P 500. Many periods with unusual high predictive power indicate the potential usefulness of the technical indicator for future trading, where “period” is defined as a block of timestamps. The task is to detect whether the technical indicator is useful or not. In this example, the structure is “temporal”.

In this paper we analyze the general setting of the multiple blocks detection problem where both the number of blocks and the block length can grow polynomially with the sample size. Note that this model contains both the sparse signal detection problem as well as the block signal detection problem as special cases. We establish the detection boundary in this setting and introduce methodology that allows optimal adaptive detection. That is, the methodology introduced below will automatically utilize such structure if it is present and provide optimal detection both when structure is present and when it is not. Therefore this methodology is preferable whether prior information about the number of blocks or the block length is available or not.

1.1 Review of sparse signal detection

Consider an $n$ -dimensional Gaussian vector with components

[TABLE]

where the $Z_{i}$ are i.i.d. standard normal random variables and the $m=n^{1-\beta}$ , $0<\beta\leq 1$ , signal locations $\ell_{1},\ldots,\ell_{m}$ are randomly drawn from $\{1,2,\ldots n\}$ without replacement. We are testing whether $H_{0}:\mu=0$ vs $H_{1}:\mu=\mu(n)>0$ .

This sparse signal model (or the closely related sparse heterogenous mixture model where $m\sim{\rm Bin}(n,n^{-\beta})$ ) has been investigated by Ingster, (1997, 1998); Ingster and Suslina, (2002); Donoho and Jin, (2004); Cai et al., (2011). The extension to the case of dependent observations was investigated in Delaigle and Hall, (2009); Hall and Jin, (2010); Zhong et al., (2013). In a related context, Ingster et al., (2009) consider the problem of classifying a high-dimensional vector from (1) as null or alternative, based on a training sample of several i.i.d. vectors from the alternative, and they derive a sharp classification boundary and classifiers attaining this boundary. Verzelen and Arias-Castro, (2017) consider the setting where one observes several i.i.d. copies of a high-dimensional vector from a mixture of two Gaussians, assuming that the difference vector of the two Gaussian means is sparse. They consider the problems of testing whether the difference in mean is zero and estimating which coordinates of the difference are non-zero, and they derive minimax lower bounds and propose a number of methods which attain these bounds.

In the testing problem for model (1) it turns out that there is a threshold effect for the likelihood ratio test. In the sparse regime where $\frac{1}{2}<\beta\leq 1$ , one calibrates $\mu=\mu(n)=\sqrt{2r\log n}$ , where $0<r\leq 1$ . By Ingster, (1997, 1998); Donoho and Jin, (2004), the* detection boundary *is defined as:

[TABLE]

If $r>\rho^{*}(\beta)$ , then $H_{0}$ and $H_{1}$ separate asymptotically, i.e. the elevated mean can be detected with asymptotic probability one, while if $r<\rho^{*}(\beta)$ , then $H_{0}$ and $H_{1}$ merge asymptotically, i.e. it is impossible to detect the elevated mean with power larger than the significance level. Unfortunately, the likelihood ratio test requires a precise specification of $r$ and $\beta$ , so one would like to have a method which is adaptive to the unknown $r$ and $\beta$ and perform as well as the likelihood ratio test.

Ingster and Suslina, (2002) introduced such an adaptive method by combining three different procedures, while Donoho and Jin, (2004) proposed the following higher criticism (HC):

[TABLE]

where $p_{i}:=\bar{\Phi}(X_{i})$ is the p-value for $X_{i}$ and the $p_{(i)}$ denote the p-values sorted in increasing order. It can be shown that HC can attain the detection boundary so it is optimal for sparse signal detection: HC will separate the two hypotheses asymptotically whenever the likelihood ratio test can asymptotically separate the two hypotheses.

Another popular choice is called Berk-Jones statistic (BJ), which is defined as follows:

[TABLE]

BJ is also optimal for sparse signal detection, see Donoho and Jin, (2004), and its finite sample performance appears to be better than that of HC, see Walther, (2013); Li and Siegmund, (2015).

Some alternative tests were studied in Jager and Wellner, (2007); Zhong et al., (2013); Walther, (2013). In particular, Jager and Wellner, (2007) have shown that all members of the $\phi$ -divergence family $S_{n}^{+}(s),s\in[-1,2]$ , attain the detection boundary (2), where $S_{n}^{+}(s)=n\,\max_{1\leq i\leq\frac{n}{2}}K_{s}\left(\frac{i}{n},p_{(i)}\right)\mathbf{1}\left(p_{(i)}<\frac{i}{n}\right)$ and $K_{s}(\cdot,\cdot)$ is given in Jager and Wellner, (2007). This family contains as special cases the Berk-Jones statistic ( $s=1$ ) and for $s=2$ a statistic that is equivalent to the higher criticism: $S_{n}^{+}(2)=\frac{1}{2}(HC_{n}^{+})^{2}$ .

Ingster and Suslina, (2002); Cai et al., (2011) extend the detection boundary to the case $0<\beta<\frac{1}{2}$ which Cai et al., (2011) call the * dense regime* (the designation * moderately sparse* in Ingster et al., (2009) is perhaps more apt), and they show that there is also a threshold effect for the likelihood ratio test. In the dense case one needs to calibrate $\mu=\mu(n)=n^{r}$ . Then the detection boundary is defined as:

[TABLE]

If $r>\rho^{*}(\beta)$ , $H_{0}$ and $H_{1}$ separate asymptotically and if $r<\rho^{*}(\beta)$ , $H_{0}$ and $H_{1}$ merge asymptotically. It is shown in Cai et al., (2011) that HC is also optimal for detection in the dense case. Note that the dense case is much less challenging since even a simple z-test will do very well, see Ingster and Suslina, (2002).

1.2 Organization of the paper and notation

In Section 2 we introduce the definition of the multiple blocks model and derive the detection boundary for this model. In Section 3 we propose procedures for detection in this model, namely the structured higher criticism and structured Berk-Jones statistics, and more generally the family of structured $\phi$ -divergences, and we evaluate their properties under the null distribution. This section also derives tail bounds for the higher criticism and Berk-Jones statistics which may be of independent interest. In Section 4, we establish the optimality of these statistics for the multiple blocks model. In Section 5, we compare the performance of structured higher criticism and structured Berk-Jones statistics with other methods. In Section 6, a simulation study is carried out to illustrate our results. Section 7 treats the multivariate case, and Section 8 deals with clusters in a network. Section 9 addresses composite alternatives. In Section 10, we discuss some possible extensions and future research topics. All proofs of the main theorems and propositions are put in Section 11. Some technical arguments are deferred to the Appendix.

We denote the number of design points contained in a set $I$ by $|I|$ . For the half-open intervals and rectangles we consider here this will typically be equal to the Lebesgue measure of $I$ . $L_{n}$ denotes terms satisfying $\log L_{n}=o(\log n)$ , which may vary from place to place. Note that for all fixed $\epsilon>0$ , $L_{n}n^{\epsilon}\rightarrow\infty$ and $L_{n}n^{-\epsilon}\rightarrow 0$ as $n\rightarrow\infty$ . We employ the usual $O_{p}$ and $o_{p}$ notation for a sequence of random variables $X_{n}$ and in addition write $X_{n}=\Omega_{p}(a_{n})$ if for every $\epsilon\in(0,1)$ there exists a finite $M>0$ such that $P(|X_{n}/a_{n}|<M)<\epsilon$ for all $n$ that are large enough. In this paper, $\log n$ is used for the natural logarithm while $\log_{2}n$ is used for logarithm to the base 2.

2 The multiple blocks model

The sparse signal model (1) posits that there is no structure in the signal. However, it turns out that if some structure does exist, then the detection problem becomes easier in a fundamental way and a much better result is attainable. Specifically, in this paper, we consider the situation where the signal is clustered into multiple blocks with unknown length. We call this the multiple blocks model:

[TABLE]

where the $I_{g}$ are mutually disjoint intervals at random locations and the $Z_{i}$ are i.i.d. standard normal random variables. The difficulty of this detection problem depends on the size of $\mu$ , the number $m$ of blocks, and the minimum block length $\min_{g}|I_{g}|$ . In order to derive a succinct theoretical result about the detection boundary we let the number of blocks $m=n^{1-\alpha-\beta}$ and assume that each of the unknown blocks $I_{g}$ has equal length $|I_{g}|=n^{\alpha}$ , $g=1,\ldots,m$ , where $0\leq\alpha<1$ and $0<\alpha+\beta\leq 1$ . The task is to test $H_{0}:\mu=0$ vs. $H_{1}:\mu=\mu(n)>0$ . All of the following results can be reformulated for unequal block lengths in terms of $\min_{g}|I_{g}|$ by using a minimax statements for lower bounds.

If $\alpha=0$ , then $|I_{g}|=1$ for all $g=1,\ldots,m$ and we obtain the sparse signal model (1). If $\alpha=1-\beta$ , then we only have $m=1$ block, and our problem reduces to the block signal detection model (11) discussed below. Thus our multiple blocks model is a generalization of both the sparse signal detection problem and the block signal detection problem.

Another special case of the multiple blocks model is investigated by Jeng et al., (2010). They use a likelihood ratio selection procedure for detecting very sparse and very short segments of elevated means, i.e both the number and the lengths of segments grow at most logarithmically with sample size. Their results suggest that this likelihood procedure will not be able to attain the detection boundary for the more general model considered here.

The multiple blocks model describes a situation where the signal arises in many locations in the form of small clusters. While this model can be analyzed with HC or BJ, the results in Sections 1.1 and 5 suggest that such an analysis would be quite suboptimal: In the sparse case $\beta>\frac{1}{2}$ , HC and BJ require that each of the $n^{1-\beta}$ signal means is of size at least $\sqrt{c(\beta)\log n}$ for some constant $c(\beta)$ . In contrast, if the $n^{1-\beta}$ signal means are aligned in one single interval, then a certain scan statistic will detect signal means as small as $\frac{\sqrt{2\beta\log n}}{n^{(1-\beta)/2}}$ , which is a drastically smaller threshold, see Section 5. This is due to the square root law which the scan exploits in this situation. These results suggest that likewise in the multiple blocks model it should be possible to drastically improve upon the power of HC and BJ by exploiting the structure of the signal. It will be shown below how this can be done by introducing the structured HC and BJ statistics. To this end, we first derive the detection boundary for this problem.

2.1 The detection boundary for the multiple blocks model

As in the sparse signal detection problem, the calibration of the detection boundary differs in the sparse and in the dense case, but the sparse case is now defined by the condition $\frac{\beta}{1-\alpha}>\frac{1}{2}$ .

Theorem 2.1.

Consider the multiple blocks model (6).

(i)

(Sparse case) If $\frac{\beta}{1-\alpha}>\frac{1}{2}$ set

[TABLE]

with $r>0$ and

[TABLE]

(ii)

(Dense case) If $\frac{\beta}{1-\alpha}<\frac{1}{2}$ set

[TABLE]

and

[TABLE]

If $r<\rho^{*}(\alpha,\beta)$ , then $H_{0}$ and $H_{1}$ merge asymptotically, i.e. the sum of Type I and Type II errors tends to 1 for any test.

The proof of the theorem can be derived from (2) and (5) by applying the square root law. The theorem shows that $\rho^{*}(\alpha,\beta)$ is a lower bound for model (6): If $r<\rho^{*}(\alpha,\beta)$ , then detection is not possible. In Sections 3 and 4 we will derive and investigate procedures which attain this lower bound when both the sparsity level and the block length are unknown, i.e. these procedures are adaptive to both $\alpha$ and $\beta$ . Hence $\rho^{*}(\alpha,\beta)$ does in fact describe the detection boundary for the model (6).

Note that the calibration of $\mu$ in Theorem 2.1 has the divisor $\sqrt{n^{\alpha}}$ which does not appear in the sparse signal detection problem. This shows that in the sparse case the multiple blocks model allows the detection of much smaller means. Even if the blocks are very short, say of length 2 or 3, this will improve upon the detection boundary (2). Longer blocks, e.g. of order $\log n$ or $n^{\alpha}$ , have an even more dramatic effect by changing the scaling of the detection boundary. It is interesting to note that in the dense case $\mu^{*}:=n^{\rho^{*}(\alpha,\beta)}/\sqrt{n^{\alpha}}=n^{\beta-\frac{1}{2}}$ does not depend on $\alpha$ , which suggests that the block structure may not be important anymore in the dense case. We discuss this issue further in Section 5.1.

3 The structured higher criticism and Berk-Jones

statistics

In order to motivate our approach we note that detection in the multiple blocks model requires to aggregate the evidence in the data in two ways: For a given candidate interval the evidence must be combined within that interval, e.g. by averaging the observations. Then this evidence must be aggregated across intervals by a multiple testing procedure such as HC. However, a straightforward implementation of this idea is not promising: The detection boundary (2) in the unstructured case is due to the multiple testing of $n$ p-values. If one were to compute a p-value for each candidate interval, then the ensuing massive multiple testing problem results in about $n^{2}$ p-values and HC may not attain the detection boundary (8). Moreover, many of these p-values will be highly correlated and so the usual critical values for HC are not applicable.

We circumvent these problems by considering an appropriate approximating set of intervals that possesses the following three properties: First, each of the about $n^{2}/2$ intervals with endpoints in $\{1,\ldots,n\}$ can be approximated sufficiently well by an interval in the approximating set so that the resulting approximation error to the signal does not detract from the detection boundary. Second, there are only $O(n\log n)$ intervals in the approximating set. As a consequence, the multiple testing does not become noticeably more difficult as HC still has to assess only of the order $n$ p-values rather than $n^{2}$ . Third, the approximating set is sparse enough to allow an analysis of the null distribution of HC in the context of independent p-values, as will be explained below.

These criteria are satisfied by the approximating set used in Walther, (2010); Rivera and Walther, (2013):

For each level $\ell=0,\ldots,\ell_{\mathrm{max}}$ , where $\ell_{\mathrm{max}}=\text{$ \lceil $}\log_{2}\frac{n}{8}\rceil$ :

[TABLE]

where $d_{\ell}=\lceil\epsilon_{\ell}2^{\ell-1}\rceil$ for $\epsilon_{\ell}=\frac{1}{6\sqrt{\log_{2}\frac{n}{2^{\ell-1}}}}=\frac{1}{6\sqrt{\ell_{\mathrm{max}}-\ell+4}}$ .

That is, the collection $\mathcal{I}_{\mathrm{app}}(\ell)$ approximates intervals with lengths in $(2^{\ell-1},2^{\ell}]$ via endpoints on a grid whose spacing is a fraction $\epsilon_{\ell}$ of the approximate interval length $2^{\ell-1}$ , where the precision parameter $\epsilon_{\ell}$ changes with the length of the intervals such that it produces a finer approximation for smaller intervals. The approximating set $\bigcup_{\ell}\mathcal{I}_{\mathrm{app}}(\ell)$ has cardinality $O(n\log n)$ but approximates all intervals sufficiently well to allow optimal inference, see Proposition 11.2 in Section 11 for a more precise statement of its properties.

Now we define structured higher criticism $sHC_{n}$ and structured Berk-Jones statistic $sBJ_{n}$ as follows:

[TABLE]

where $HC_{n_{\ell}}(\ell)$ and $BJ{}_{n_{\ell}}(\ell)$ denote the one-sided higher criticism (3) and Berk-Jones statistic (4) evaluated on the

[TABLE]

p-values pertaining to $\mathcal{I}_{\mathrm{app}}(\ell)$ , i.e. the p-values $\{\bar{\Phi}({\boldsymbol{X}}(I)),\,I\in\mathcal{I}_{\mathrm{app}}(\ell)\}$ , where ${\boldsymbol{X}}(I):=\sum_{i\in I}X_{i}/\sqrt{|I|}$ is the standardized average over the interval $I$ .

More generally, we define for $s\in[-1,2]$ the structured $\phi$ -divergence

[TABLE]

where likewise $S_{n_{\ell}}^{+}(s,\ell)$ denotes the $\phi$ -divergence $S_{n_{\ell}}^{+}(s)$ defined in Section 1.1, evaluated on the p-values that pertain to $\mathcal{I}_{\mathrm{app}}(\ell)$ . In particular, $sS_{n}(1)=sBJ_{n}$ and $sS_{n}(2)$ is equivalent to $sHC_{n}$ .

The difficulty in analyzing the null distributions of $BJ{}_{n_{\ell}}(\ell)$ and $HC_{n_{\ell}}(\ell)$ lies in the fact that the underlying p-values are no longer independent because they are based on data pertaining to intervals that may overlap. The key to controlling those null distributions is the sparse construction of the approximating set $\mathcal{I}_{\mathrm{app}}(\ell)$ : It is shown in Lemma 11.1 that the intervals in $\mathcal{I}_{\mathrm{app}}(\ell)$ can be grouped into a small number of groups such that each group contains about $\frac{n}{2^{\ell}}$ intervals that are disjoint and whose corresponding p-values are therefore independent. Hence the empirical measure of the p-values can be written as an average of a small number of empirical measures, each of which is based on independent p-values. This allows to use Jensen’s inequality to bound $BJ{}_{n_{\ell}}(\ell)$ and $HC_{n_{\ell}}(\ell)$ by the maximum of a small number of such statistics, each of which is based on $\frac{n}{2^{\ell}}$ independent p-values. This maximum can then be controlled via tail bounds for these statistics. Furthermore, this explanation shows that the scaling in $HC_{n_{\ell}}(\ell)$ should be $\sqrt{\frac{n}{2^{\ell}}}$ rather than $\sqrt{n_{\ell}}$ , hence the rescaling factor $\sqrt{\frac{n}{2^{\ell}n_{\ell}}}$ for $HC_{n_{\ell}}(\ell)$ , and analogously for $BJ{}_{n_{\ell}}$ and the structured $\phi$ -divergence.

Theorem 3.1.

Under the null hypothesis $\mu=0$ ,

[TABLE]

Note that under the null distribution $\frac{BJ_{n}}{\log\log n}\stackrel{{\scriptstyle p}}{{\rightarrow}}1$ and $\frac{HC_{n}}{\sqrt{2\log\log n}}\stackrel{{\scriptstyle p}}{{\rightarrow}}1$ , see Jager and Wellner, (2007). Thus the penalty for additionally examining structure in the data is at most a factor of 3 for $sBJ_{n}$ . In particular, the more general $sBJ_{n}$ is still optimal in the special case (1) when there is no structure in the signal, and likewise for $sHC_{n}$ .

As an aside, it is not clear that the result of Theorem 3.1 for $sHC_{n}$ can be improved as the smallest p-values have heavy tails, see Walther, (2013). While that can be controlled in the case of a single HC statistic, see e.g. page 601-603 of Shorack and Wellner, (1986), $sHC_{n}$ is the maximum of $\sim\log n$ terms that involve HC statistics. In the context of a sparse Gaussian graphical model Fan et al., (2013) give a result about how to group correlated p-values to guarantee independence within each group, but this result is not applicable here.

For the proof of the theorem we will need the following tail bounds which may be of independent interest:

Proposition 3.2.

Let $F_{n}$ be the empirical cdf of $U_{1},\ldots,U_{n}$ i.i.d. $U(0,1)$ and let $U(\cdot)$ be a standard Brownian Bridge. For $0<a<b<1$ and $\eta>0$ :

(i)

[TABLE]

(ii)

[TABLE]

(iii)

For every $K>1$ :

[TABLE]

(iv)

[TABLE]

for $\eta\geq\sqrt{D\log\log n}$ with $D>2$ where the constant C depends only on D.

Miller and Siegmund, (1982) give a two-sided bound corresponding to (i) which holds asymptotically. (ii) improves the exponential bound provided in Duembgen and Wellner, (2014). As for (iv), there exists no exponential inequality for $HC_{n}$ due to the heavy algebraic tails of the smallest p-values, see Walther, (2013).

4 Optimality of the structured higher criticism and structured Berk-Jones

statistic for the multiple blocks model

The following theorem shows that every structured $\phi$ -divergence, and in particular the structured higher criticism and the structured Berk-Jones statistic, attain the lower bound established in Theorem 2.1 for the sparse case. The theorem also shows that the structured higher criticism statistic attains the lower bound in the dense case. Thus these procedures are optimal for detection in the multiple blocks model and are adaptive to both the unknown block length and the unknown sparsity level.

Theorem 4.1.

Consider the multiple blocks model (6).

(i)

In the sparse case $\frac{\beta}{1-\alpha}>\frac{1}{2}$ with the calibration (7) for the mean of the signal, let $r>\rho^{*}(\alpha,\beta)$ in (8). Then every member of the family of structured $\phi$ -divergences $sS_{n}(s)$ , $s\in[-1,2]$ , has asymptotic power 1.

(ii)

In the dense case $\frac{\beta}{1-\alpha}<\frac{1}{2}$ with the calibration (9) for the mean of the signal, let $r>\rho^{*}(\alpha,\beta)$ in (10). Then $sHC_{n}$ has asymptotic power 1.

We note that while there are $O(n^{2})$ possible intervals, the use of the approximation set makes it possible to compute these structured statistics in $O(n\log^{2}n)$ time, almost linear in the number of observations.

As a corollary to the above theorem we note that $sHC_{n}$ and $sBJ_{n}$ are optimal for sparse signal detection and for block signal detection, which are special cases the model (6):

Corollary 4.2.

$sHC_{n}$ * and $sBJ_{n}$ achieve the optimal detection boundary (2) in the sparse signal model (1).*

The corollary follows upon observing that the sparse signal model (1) obtains as the special case $\alpha=0$ . By Theorem 4.1(i), $sHC_{n}$ and $sBJ_{n}$ can reliably detect the alternative if $r>\rho^{*}(0,\beta)$ in (8), which equals the detection boundary (2).

In the block signal detection problem the signal is aligned in an interval $I_{n}$ , i.e.

[TABLE]

Corollary 4.3.

$sHC_{n}$ * and $sBJ_{n}$ achieve the optimal detection boundary for the block signal detection problem (11).*

The block signal detection problem corresponds to $\alpha=1-\beta$ . By Theorem 4.1(i), $sBJ_{n}$ and $sHC_{n}$ can reliably detect the alternative if $r>\rho^{*}(1-\beta,\beta)$ in (8), where

[TABLE]

Thus, when writing the alternative in terms of $\mu$ , we can reliably detect the alternative if

[TABLE]

for any $\epsilon>0$ , which matches the optimal detection boundary for block signal detection given in Section 5 in terms of rate and constant. (The more refined result in Section 5 even allows $\epsilon_{n}\downarrow 0$ at a certain rate for the penalized scan, and it is not clear whether $sBJ_{n}$ or $sHC_{n}$ can attain that behavior near the boundary.)

5 Comparison with other

methods

In this section we compare structured BJ and HC with relevant other methodology in terms of their theoretical performance. Section 6 will complement this comparison with a simulation study.

Perhaps the most obvious approach to the multiple blocks model is to directly use HC or BJ. Note that this approach ignores the block structure in the data.

In the sparse unstructured case $\frac{1}{2}<\beta\leq 1$ , if we use the calibration (7), then the detection boundary (2) for HC becomes

[TABLE]

In the dense unstructured case $0<\beta<\frac{1}{2}$ , if we use the calibration (9), then the detection boundary (5) for HC becomes

[TABLE]

While the above detection boundaries are for the unstructured case, it follows that HC and BJ cannot improve on these boundaries in the multiple blocks model because they are invariant under permutations of the observations and hence the block structure has no effect on the inference. Therefore HC and sHC compare as follows:

When $\beta>\frac{1}{2}$ (and so $\frac{\beta}{1-\alpha}>\frac{1}{2}$ ), then both HC and sHC are in the sparse regime. Compared to sHC, the detection boundary for HC is increased by a factor of $\sqrt{n^{\alpha}}$ . Unless $\alpha=0$ (i.e. the length of the block is 1), the loss of power of HC is significant.

2.

When $\frac{1-\alpha}{2}<\beta<\frac{1}{2}$ , then HC is in the dense regime and sHC is in the sparse regime. Nevertheless, sHC has a more favorable detection boundary: Compared to sHC, the detection boundary for HC is increased by a factor (up to a $\log n$ term) of $\frac{\sqrt{n^{\alpha}}}{n^{\frac{1}{2}-\beta}}=n^{\beta-\frac{1-\alpha}{2}}$ , which grows polynomially with $n$ . Therefore the loss of power of HC is also significant.

3.

When $\frac{}{}$$\beta<\frac{1-\alpha}{2}$ (and so $\beta<\frac{1}{2}$ ), then both HC and sHC are in the dense regime. The detection boundaries are the same for both methods and thus both HC and sHC are optimal for the multiple blocks model. The reason for this is that now the fraction of elevated means is so large that the block structure does not provide a noticeable benefit any more.

In light of the block structure in the data, another alternative approach would be to use a scan statistic. Note that a scan statistic is designed to detect a signal on an interval but not to aggregate the evidence across multiple intervals. It is shown in Chan and Walther, (2013) that the scan with scale-dependent critical values, such as the penalized scan

[TABLE]

dominates the regular scan, so we will only discuss the former. Moreover, it is shown in Chan and Walther, (2013) that evaluating the penalized scan on an approximating set:

[TABLE]

will not detract from its performance, while reducing the computational effort from $O(n^{2})$ to $O(n\log n)$ .

For the block signal detection problem (11) where the signal is aligned in an interval $I_{n}$ , it is shown in Chan and Walther, (2013) that $P_{n}^{\mathrm{app}}$ has asymptotic power one if $\mu=\mu(n)\geq(\sqrt{2}+\epsilon_{n})\sqrt{\log\frac{en}{|I_{n}|}}/\sqrt{|I_{n}|}$ with $\epsilon_{n}\sqrt{\log\frac{en}{|I_{n}|}}\rightarrow\infty$ , while no consistent test exists if $\mu=\mu(n)\leq(\sqrt{2}-\epsilon_{n})\sqrt{\log\frac{en}{|I_{n}|}}/\sqrt{|I_{n}|}$ with $\epsilon_{n}\sqrt{\log\frac{en}{|I_{n}|}}\rightarrow\infty$ . Thus $P_{n}$ and $P_{n}^{\mathrm{app}}$ are optimal tests if the signal is aligned in a single interval. See Walther, (2010) for a corresponding result in the multivariate case and Arias-Castro et al., (2005) for earlier work deriving the threshold $\sqrt{2\log n}/\sqrt{|I_{n}|}$ for the regular (unpenalized) scan, which is optimal for very short interval lengths $|I_{n}|$ up to about $\log n$ .

If we consider instead the multiple blocks model (6), then we obtain the following result:

Theorem 5.1.

In the multiple blocks model (6) the detection boundary for the penalized scans $P_{n}$ and $P_{n}^{\mathrm{app}}$ given in (12) and (13) is

[TABLE]

with calibration (9) in the first case and calibration (7) in the second.

Thus the penalized scan attains the optimal detection boundary except in the case $\frac{1-\alpha}{2}<\beta<\frac{3(1-\alpha)}{4}$ , where $\rho_{\mathrm{pen}}^{*}(\alpha,\beta)$ is larger than $\rho^{*}(\alpha,\beta)$ given in (8).

5.1 Discussion: What matters

for good inference?

Efficient inference in the multiple blocks model requires to combine the evidence in two different ways: the evidence within a block needs to be combined in order to make use of the square root law, and then this evidence needs to be aggregated across blocks.

In the very sparse case $\frac{\beta}{1-\alpha}\geq\frac{3}{4}$ , the block structure is the most important aspect. In order to aggregate the information across blocks it is sufficient to simply scan for the maximum of the within-block statistics. For this reason, the penalized scan and sHC/sBJ perform well, whereas HC and BJ exhibit a severe loss of power because they do not make use of the structure in the signal and therefore forego the considerable advantage that derives from the square root law.

In the sparse case $\frac{1}{2}<\frac{\beta}{1-\alpha}<\frac{3}{4}$ , the block structure is still very important. However, optimally aggregating the information across blocks requires an approach that is more sophisticated than simply scanning for the maximum of the within-block statistics. This explains why sHC/sBJ are optimal while HC and BJ still exhibit a severe loss of power as they do not make use of the structure in the signal.

The dense case $\frac{\beta}{1-\alpha}<\frac{1}{2}$ turns out to be the regime where the structure in the signal is of no help for inference any more. The reason for this perhaps surprising fact is that the fraction of elevated means is now so large that asymptotic optimality obtains via the square root law by simply averaging all observations, i.e. performing a z-test. While HC and BJ are geared towards the sparse case, they do attain the detection boundary in this dense case also, and so do the structured versions sHC and sBJ and the penalized scan.

6 Simulation study

This section provides a simulation study that compares the performance of sHC, sBJ, HC, BJ, and the penalized scan. The sample size is $n=10000$ and power is with respect to a significance level of $5\%$ . Critical values for this significance level were simulated with 10000 simulations and power was estimated with 2000 simulations.

6.1 Simulation results for the very sparse case

We set $\alpha=0.2$ and $\beta=0.65$ , so $\beta/(1-\alpha)\approx 0.813$ . Power for the various methods is plotted in Figure 1 as a function of $r$ in the calibration (7). The plot shows that the penalized scan has the highest power, followed by the structured HC and sBJ. HC and BJ are nearly powerless even for large values of $r$ . This simulation result confirms our conclusions from Section 5. sBJ has less power than sHC partly because the first few p-values in the appropriate level contain the most information in the very sparse regime and sHC effectively puts more weights toward those than sBJ, see Walther, (2013) for an explanation of this phenomenon in the setting without structure.

6.2 Simulation results for the sparse case

We set $\alpha=0.2$ and $\beta=0.48$ , so $\beta/(1-\alpha)=0.6$ . Figure 2 shows that sHC and sBJ have much higher power than HC and BJ, as predicted by our theory. The penalized scan still does very well.

6.3 Simulation result for dense case

We set $\alpha=0.3$ and $\beta=0.25$ , so $\beta/(1-\alpha)=0.357$ . Since we are now in the dense regime, the scale for $r$ is with respect to the calibration (9). While all five methods are asymptotically optimal in this situation, Figure 3 shows that there is quite some spread in the performance in this finite sample setting. This reflects the observation in Walther, (2013) that for these types of problems the asymptotics set in only slowly and that performance should be assessed by simulations. sBJ is the clear winner in this case. HC and sHC have the worst performance, which is the flip side of the effect described in Section 6.1 as the relevant information is now contained away from the smallest p-values. Moreover, we can see that the structured versions of HC and BJ are more powerful than their original counterparts, which indicates sHC and sBJ can take some advantage of the structure in the signal even in the dense case.

7 The multivariate case

All of the previous results can be readily extended to a multivariate setting. We will use the superscript $(d)$ to denote the dimension. In order to keep the notation simple we will focus on the bivariate case which already contains all the relevant ideas. The model (6) then becomes

[TABLE]

where the $Z_{ij}$ are i.i.d. standard normal and $\boldsymbol{1}_{\cup_{g=1}^{m}I_{g}}(i,j)=1$ iff the grid point $(i,j)$ is contained in an axis-parallel rectangle $I_{g}$ for some $g\in 1,\ldots,m$ . Analogously to the univariate case we assume that the rectangles $I_{g}$ are mutually disjoint and randomly located on the Cartesian grid $\{1,\ldots,n\}^{2}$ . The number of axis-parallel rectangles (blocks) is now parametrized by $m=n^{2(1-\alpha-\beta)}$ and each unknown rectangle $I_{g}$ contains $|I_{g}|=n^{2\alpha}$ grid points, where $0\leq\alpha<1$ and $0<\alpha+\beta\leq 1$ .

The task is to test $H_{0}:\mu=0$ vs. $H_{1}:\mu=\mu(n)>0$ . It was seen in the univariate case that the construction of an appropriate approximating set is critical for optimally aggregrating the information within and across blocks. This univariate approximating set can be easily extended to the multivariate situation be taking cross-products: Recall that in the univariate case the approximation set $\mathcal{I}_{\mathrm{app}}(\ell)$ depends on a precision parameter $\epsilon_{\ell}$ . We now make this dependence explicit by writing $\mathcal{I}_{\mathrm{app}}(\ell,\epsilon_{\ell})$ for this univariate collection. Now we construct a multivariate approximation set for axis-parallel rectangles in $\{1,\ldots,n\}^{d}$ via the cross-product of univariate approximation sets $\operatorname*{\scalerel*{\times}{\sum}}_{i=1}^{d}\mathcal{I}_{\mathrm{app}}(\ell_{i},\epsilon_{\ell})$ , where the precision parameter $\epsilon_{\ell}$ depends on the volume of the rectangle but the $\ell_{i}$ may vary to allow various aspect ratios:

For each level $\ell=0,\ldots,\ell_{\mathrm{max}}:=\lceil\log_{2}(\frac{n}{8})^{d}\rceil$ we set

[TABLE]

with $\epsilon_{\ell}=\frac{1}{6\sqrt{\log_{2}\frac{n^{d}}{2^{\ell-1}}}}$ . While this construction is somewhat different from that given in Walther, (2010) for the density case, it enjoys similar properties, see Proposition 11.2 in Section 11. In particular, the cardinality of $\bigcup_{\ell}\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ is $O\Bigl{(}n^{d}(\log n)^{d}\Bigr{)}$ , so relevant computation can be done in time that is almost linear in the number of observations $n^{d}$ .

Now we can construct our test statistics exactly as in the univariate case: The structured higher criticism $sHC_{n}^{(d)}$ and structured Berk-Jones statistic $sBJ_{n}^{(d)}$ are defined as follows:

[TABLE]

where $HC_{n_{\ell}}(\ell)$ and $BJ{}_{n_{\ell}}(\ell)$ denote the one-sided higher criticism (3) and Berk-Jones statistic (4) evaluated on the $n_{\ell}:=\#\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ p-values pertaining to $\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ , i.e. the p-values $\{\bar{\Phi}(\sum_{(i,j)\in I}X_{ij}/\sqrt{|I|}),I\in\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)\}$ .

Note that the definition of these structured statistics differs from the univariate case only in the rescaling factor $\frac{n^{d}}{2^{\ell}}$ in place of $\frac{n}{2^{\ell}}$ . This is due to the fact that we now have an array of $n^{d}$ observations rather than $n$ . Thus there are now about $\frac{n^{d}}{2^{\ell}}$ disjoint intervals in $\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ , hence about $\frac{n^{d}}{2^{\ell}}$ of the p-values are independent.

We now focus on the bivariate case and establish the null distribution of these statistics:

Theorem 7.1.

Under the null hypothesis $\mu=0$ ,

[TABLE]

The lower bound for detection in the sparse case $\frac{\beta}{1-\alpha}>\frac{1}{2}$ is the same as in the univariate setting after accounting for the sample size $n^{2}$ in place of $n$ , and $sHC_{n}^{(2)}$ and $sBJ_{n}^{(2)}$ are asymptotically optimal for detection:

Theorem 7.2.

The conclusions of Theorems 2.1(i) and 4.1(i) continue to hold for the model (14) with $\frac{\beta}{1-\alpha}>\frac{1}{2}$ and the calibration

[TABLE]

That is, if $r<\rho^{*}(\alpha,\beta)$ , where $\rho^{*}$ is given in (8), then $H_{0}$ and $H_{1}$ merge asymptotically, i.e. the sum of Type I and Type II errors tends to 1 for any test. If $r>\rho^{*}(\alpha,\beta)$ , then $sHC_{n}^{(2)}$ and $sBJ_{n}^{(2)}$ have asymptotic power 1.

8 Clusters in a network on the square lattice

This section concerns the problem of detecting whether in a given network, e.g. in a network of sensors, there are clusters of nodes that exhibit an “unusual behavior”. This setting is important for a number of applications, e.g. in surveillance, environmental monitoring and disease outbreak detection, see Arias-Castro et al., (2011) who treat the case of detecting a single (or a small number of) clusters in a network.

Here we show how the evidence of such unusual behavior can be aggregated over many such clusters. We follow Arias-Castro et al., (2011) and model the network with the $d$ -dimensional square lattice. For simplicity we will derive our results for the case $d=2$ , which already contains all the essential ideas. We are interested in the case where the signal is present on graph neighborhoods of vertices, which we model as open balls $B_{r}(x)$ with center $x\in\{1,\ldots,n\}^{2}$ and radius $r$ . The results in this section hold for balls with respect to the $\ell^{1}$ -norm, which corresponds to the shortest-path distance in a graph, as well as the Euclidean norm. We derive our results for the latter as this is the technically more demanding case, see Lemma 11.5. Our model is therefore

[TABLE]

where the $Z_{ij}$ are i.i.d. standard normal and each graph neighborhood $N_{g}$ is a ball with respect to the $\ell^{2}$ -norm (or the $\ell^{1}$ -norm) that contains $|N_{g}|=n^{2\alpha}$ grid points, where $0\leq\alpha<1$ . As before, we assume that the $N_{g}$ are mutually disjoint and randomly located on the Cartesian grid $\{1,\ldots,n\}^{2}$ and the number of balls is parametrized by $m=n^{2(1-\alpha-\beta)}$ .

The task is to test $H_{0}:\ \mu=0$ vs. $H_{1}:\ \mu=\mu(n)>0$ . In order to apply the general recipe of this paper for optimally aggregating the information within and across neighborhoods, we need to construct an appropriate approximating set for the neighborhoods. The idea for this construction can be adapted from the previous settings, which shows the generality of this approach:

We approximate balls with volume in $(\pi 2^{\ell-1},\pi 2^{\ell}]$ , where $\ell=0,\ldots,\ell_{\mathrm{max}}=\lceil\log_{2}\frac{n^{2}}{8}\rceil$ , with the collection

[TABLE]

where $\epsilon_{\ell}:=\frac{1}{\sqrt{\log_{2}\frac{n^{2}}{2^{\ell-1}}}}$ , $d_{\ell}=\lceil\epsilon_{\ell}2^{\frac{\ell-1}{2}}\rceil$ . That is, we approximate the centers with a grid whose spacing is a fraction $\epsilon_{\ell}$ of the square root of the approximate volume of the ball, $2^{\ell-1}$ , and we approximate the square radius with a geometric progression. Proposition 8.1 shows that the balls in $\bigcup_{\ell}\mathcal{C}_{\mathrm{app}}(\ell)$ can approximate every ball with small relative error, while the cardinality of $\bigcup_{\ell}\mathcal{C}_{\mathrm{app}}(\ell)$ is almost linear in the sample size $n^{2}$ :

Proposition 8.1.

(i)

$\#\bigcup_{\ell}\mathcal{C}_{\mathrm{app}}(\ell)\ =\ O\Bigl{(}n^{2}(\log n)^{\frac{3}{2}}\Bigr{)}$ * *

(ii)

For every ball $B_{R}(s,t)$ with $R^{2}\in[1,\frac{n^{2}}{8}]$ and $s,t\in[R,n-R+1]$ there exists $B_{r}(j,k)\in\bigcup_{\ell}\mathcal{C}_{\mathrm{app}}(\ell)$ such that

[TABLE]

Furthermore, it will be shown in the proof of Theorem 8.2 that the balls in $\mathcal{C}_{\mathrm{app}}(\ell)$ can be grouped into a small number of at most $8(\log n)^{1/2}$ of groups such that each group contains $\sim\frac{n^{2}}{2^{\ell+2}}$ mutually disjoint balls. This allows to define the structured higher criticism and Berk-Jones statistics as in Section 7, where as before $HC_{n_{\ell}}(\ell)$ and $BJ{}_{n_{\ell}}(\ell)$ denote the one-sided higher criticism (3) and Berk-Jones statistic (4) evaluated on the $n_{\ell}:=\#\mathcal{C}_{\mathrm{app}}(\ell)$ p-values pertaining to $\mathcal{C}_{\mathrm{app}}(\ell)$ , i.e. the p-values $\{\bar{\Phi}(\sum_{(i,j)\in I}X_{ij}/\sqrt{|I|}),I\in\mathcal{C}_{\mathrm{app}}(\ell)\}$ . As a consequence we obtain results for the null distributions of these statistics and optimality properties that are analogous to those for univariate and multivariate rectangles:

Theorem 8.2.

Under the null hypothesis $\mu=0$ there exists $C>0$ such that

[TABLE]

Moreover, the conclusions of Theorem 7.2 continue to hold for model (15).

9 Composite alternatives

We developed the above theory for the one-sided alternative $H_{1}:\mu=\mu(n)>0$ in the models (6, 14,15). A more general alternative obtains by allowing the sign of $\mu$ to change from block to block. Then model (6) becomes

[TABLE]

where $s_{g}\in\{1,2\}$ . Such alternatives can be tested by applying the structured test statistic to the p-values from two-sided z-tests, as suggested by Delaigle and Hall, (2009) in the context of the higher criticism statistic. That is, one employs the two-sided p-values $2\bar{\Phi}(|{\boldsymbol{X}}(I)|)$ in place of $\bar{\Phi}({\boldsymbol{X}}(I))$ . This test procedure satisfies the same optimality results that we derived for the one-sided case since the proofs of these results depend on establishing certain polynomial growth rates up to logarithmic factors, while the use of two-sided p-values affects these rates only with a factor 2. Likewise, the optimality results for the multivariate and network settings (14,15) continue to hold for two-sided p-values.

An alternative approach is to base the analysis not on scaled averages ${\boldsymbol{X}}(I)$ but on mean squares. The technical analysis of the resulting structured test statistic will be somewhat different since it involves the tails of noncentral chi-squared distributions rather than Gaussians. In the context of the sparse signal model, Donoho and Jin, (2004) find that the higher criticism statistic based on data from a chi-squared distribution achieves the same optimal detection region as in the Gaussian case.

10 Discussion

In this paper, we established the lower bound for detection in the multiple blocks model. An asymptotically optimal method is also proposed which is adaptive to the unknown number of blocks and to the unknown block length. It was shown how this methodology can be readily extended to the multivariate situation and to detecting clusters in a network.

Another interesting problem for future research is the identification version of this problem, in which we not only want to detect whether a signal is present, but we also want to approximately find the location of all blocks of signals. In Kou, (2021) it is shown that when there is only one block of signals (corresponding to $\alpha+\beta=1)$ , then the identification and the detection problem are of the same difficulty. However, in the more general case where $\alpha+\beta<1$ some calculations show that the identification problem is necessarily more difficult than the detection problem, in the sense that the lower bound for the former is larger. To the best of our knowledge, an adaptively optimal method is not yet known for the corresponding multiple blocks identification problem. We leave this as an open problem for future research.

11 Proofs

11.1 Some basic results

It is helpful to analyze the statistical behavior of the test statistics via tail probabilities. To this end, note that since $\bar{\Phi}$ is strictly decreasing we have the following representation for distinct real numbers $X_{1},\ldots,X_{n}$ and $p_{i}:=\bar{\Phi}(X_{i})$ :

[TABLE]

The following Lemma summarizes important properties of the univariate approximating set $\mathcal{I}_{\mathrm{app}}(\ell)$ while Proposition 11.2 gives some relevant results for the multivariate version $\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ in dimensions $d\geq 1$ . Further results for the bivariate case can be found in Lemma 11.4.

Lemma 11.1.

The intervals in $\mathcal{I}_{\mathrm{app}}(\ell)$ can be grouped into at most $\min(2^{2\ell},4\epsilon_{\ell}^{-2})\leq 144\log_{2}n$ groups such that each group consists of either $\lfloor\frac{n}{L_{\ell}}\rfloor$ or $\lfloor\frac{n}{L_{\ell}}\rfloor-1\geq\lfloor\frac{n}{2^{\ell}}\rfloor-1$ disjoint intervals, where $L_{\ell}$ is the largest multiple of $d_{\ell}$ that is not larger than $2^{\ell}$ . Further, $\#\mathcal{I}_{\mathrm{app}}(\ell)\leq n2^{-\ell}\min(2^{2\ell},4\epsilon_{\ell}^{-2})\leq 144n2^{-\ell}\log_{2}n$ .

Proof of Lemma 11.1.

Let $S_{\ell}$ be the collection of all intervals in $\mathcal{I}_{\mathrm{app}}(\ell)$ whose left endpoint is smaller than $L_{\ell}$ , where $L_{\ell}$ is the largest multiple of $d_{\ell}$ that is not larger than $2^{\ell}$ . For a given $I\in S_{\ell}$ consider the collection of shifts of $I$ by multiples of $L_{\ell}$ : $\mbox{shift}_{\ell}(I):=\Bigl{\{}J\subset(0,n]:\,J=kL_{\ell}+I,\ k=0,1,2,\ldots\Bigr{\}}$ . Since $I\in\mathcal{I}_{\mathrm{app}}(\ell)$ implies $|I|\leq L_{\ell}$ , the intervals in $\mbox{shift}_{\ell}(I)$ are disjoint and there are either $\lfloor\frac{n}{L_{\ell}}\rfloor$ or $\lfloor\frac{n}{L_{\ell}}\rfloor-1\geq\lfloor\frac{n}{2^{\ell}}\rfloor-1$ intervals in $\mbox{shift}_{\ell}(I)$ . One readily observes that each interval $I\in\mathcal{I}_{\mathrm{app}}(\ell)$ can be generated by such a shift:

[TABLE]

Finally, there are exactly $\frac{L_{\ell}}{d_{\ell}}$ different starting points for intervals $I\in S_{\ell}$ . Since each such interval $I$ satisfies $2^{\ell-1}<|I|\leq 2^{\ell}$ we obtain for $\ell\geq 1$ :

[TABLE]

and the same bound holds for $\ell=0$ . As for the upper bound on $\#\mathcal{I}_{\mathrm{app}}(\ell)$ , an analogous counting argument shows that there are not more than $\frac{n}{d_{\ell}}$ starting points, each having not more than $\lceil\frac{2^{\ell}-2^{\ell-1}}{d_{\ell}}\rceil\leq\frac{2^{\ell}}{d_{\ell}}$ endpoints if $\ell\geq 1$ . Hence $\#\mathcal{I}_{\mathrm{app}}(\ell)\leq\frac{n2^{\ell}}{d_{\ell}^{2}}$ and the claimed bound follows from the above inequality; the same bound clearly also holds for $\ell=0$ . ∎

Proposition 11.2.

(i)

$\#\bigcup_{\ell}\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)\ =\ O\Bigl{(}n^{d}(\log n)^{d}\Bigr{)}$ **

(ii)

For every axis-parallel rectangle $R\subset\{1,\ldots,n\}^{d}$ with sides not longer than $\frac{n}{8}$ there exists $\tilde{R}\in\bigcup_{\ell}\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ such that $|R\triangle\tilde{R}|\leq C_{d}\frac{|R|}{\sqrt{\log_{2}\frac{n^{d}}{|R|}}}$ for some universal constant $C_{d}$ .

(iii)

The definition of $\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ implies the constraint $\ell\leq\sum_{i=1}^{d}\ell_{i}\leq\ell+d-1$ for the marginal levels $\ell_{i}$ .

Employing the latter constraint is helpful for efficiently enumerating the rectangles in $\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ , e.g. in simulations. As an aside, if one modifies the definition of $\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ to let the $\ell_{i}$ be as large as $\lceil\log_{2}n\rceil$ and $\ell$ as large as $\log_{2}(n^{d}/8)$ , then $\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ will also contain approximating rectangles for all marginal distributions.

Proof of Proposition 11.2.

Lemma 11.1 gives $\#\mathcal{I}_{\mathrm{app}}(\ell_{i},\epsilon_{\ell})\leq n2^{-\ell_{i}}\left(4\epsilon_{\ell}^{-2}\right)\leq$ $144n2^{-\ell_{i}}\log_{2}n^{d}$ . Hence $\#\bigcup_{\ell_{i}=0}^{\lceil\log_{2}\frac{n}{8}\rceil}\mathcal{I}_{\mathrm{app}}(\ell_{i},\epsilon_{\ell})\leq 288dn\log_{2}n$ and so

[TABLE]

proving (i).

As for (ii), let $R=I_{1}\times\ldots\times I_{d}$ be an axis-parallel rectangle, so each $I_{i}$ is an interval of the form $(j_{i},k_{i}]\subset(0,n]$ with length at most $n/8$ . Hence there exists $\ell_{i}\in\{0,\ldots,\lceil\log_{2}\frac{n}{8}\rceil\}$ such that $2^{\ell_{i}-1}<|I_{i}|\leq 2^{\ell_{i}}$ , and there exists $\ell\in\{0,\ldots,\lceil\log_{2}(\frac{n}{8})^{d}\rceil\}$ such that $2^{\ell-1}<|R|\leq 2^{\ell}$ . So by the definition of $\mathcal{I}_{\mathrm{app}}(\ell_{i},\epsilon_{\ell})$ , there exists $\tilde{I}_{i}\in\mathcal{I}_{\mathrm{app}}(\ell_{i},\epsilon_{\ell})$ with $|I_{i}\triangle\tilde{I}_{i}|\leq 2d_{\ell_{i}}\leq 2\epsilon_{\ell}|I_{i}|$ for $i=1,\ldots,d$ . Thus, by decomposing $R\triangle\tilde{I}$ and collecting terms, we get $|R\triangle\tilde{I}|\leq C_{d}\epsilon_{\ell}|R|$ for a constant $C_{d}$ . Finally, $\epsilon_{\ell}=\frac{1}{6\sqrt{\log_{2}\frac{n^{d}}{2^{\ell-1}}}}\leq\frac{1}{6\sqrt{\log_{2}\frac{n^{d}}{|R|}}}$ . (If we arrange $\tilde{I}_{i}\subset I_{i}$ for all $i$ by modifying the definition of $\mathcal{I}_{\mathrm{app}}$ somewhat, then clearly $|R\triangle\tilde{I}|\leq 2d\epsilon_{\ell}|R|$ .)

Concerning (iii), $R\in\mathcal{I}_{\mathrm{app}}^{(d)}(\ell)$ implies $R=I_{1}\times\ldots\times I_{d}\in\operatorname*{\scalerel*{\times}{\sum}}_{i=1}^{d}\mathcal{I}_{\mathrm{app}}(\ell_{i},\epsilon_{\ell})$ . So $2^{\ell_{i}-1}<|I_{i}|\leq 2^{\ell_{i}}$ and $2^{\ell-1}<\prod_{i=1}^{d}|I_{i}|\leq 2^{\ell}$ , hence $\ell-1<\sum_{i}\ell_{i}$ and $\sum_{i}(\ell_{i}-1)<\ell$ . ∎

11.2 Proofs for Section 2

Proof of Theorem 2.1.

(i) We may assume without loss of generality that $\frac{n}{n^{\alpha}}$ is an integer. Denote by (A) the submodel where the signals can only start and end on a grid given by $\{i(n^{\alpha})+1,\ldots(i+1)n^{\alpha}\}$ for $i=0,\ldots,\frac{n}{n^{\alpha}}-1$ . It is enough to show that Theorem 2.1 holds for this submodel (A) since detection in the submodel is not more difficult than in the original model and hence the detection boundary for the submodel cannot be larger than for the original model.

Let $S_{i}:=\sum_{j=1}^{n^{\alpha}}X_{(i-1)n^{\alpha}+j}/\sqrt{n^{\alpha}}=:s_{i}+Z_{i}^{{}^{\prime}}$ for $i=1,\ldots,n^{{}^{\prime}}$ , where $n^{{}^{\prime}}=\frac{n}{n^{\alpha}}=n^{1-\alpha}$ . Then $Z_{i}^{{}^{\prime}}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,1)$ , and $s_{i}=0$ for all but $n^{1-\alpha-\beta}$ locations, while at these locations $s_{i}=\sqrt{2r\log n}=\sqrt{2r^{{}^{\prime}}\log n^{{}^{\prime}}}$ , where $r^{{}^{\prime}}=\frac{r}{1-\alpha}$ . The locations of these elevated means are a random sample without replacement from $\{1,\ldots,n^{{}^{\prime}}\}$ . We denote this model by (B). (B) is in fact a sparse signal model (1) with $n^{{}^{\prime}}=n^{1-\alpha}$ , and sparsity level $\beta^{{}^{\prime}}=\frac{\beta}{1-\alpha}>\frac{1}{2}$ . It was proved in Ingster, (1997, 1998); Ingster and Suslina, (2002), see also section 1.1 in Donoho and Jin, (2004), that the lower bound for model (B) is given by (2) with $\beta^{{}^{\prime}}$ in place of $\beta$ . (In some of these proofs the number of elevated means follows a binomial distribution, but the result is regularly referenced for the setting (1) where that number is fixed, see e.g. Hall and Jin, (2010); Zhong et al., (2013). Theorem 4 in Ingster, (1997) shows that the result does indeed carry over to the setting (1), and the proof in Ingster and Suslina, (2002) is explicitly for this setting.)

Writing $\mu^{{}^{\prime}}:=\sqrt{2r\log n}=\mu\sqrt{n^{\alpha}}$ and observing

[TABLE]

shows that the likelihood ratio test has the same test result on model (A) and (B). Therefore, written in our original notation $\alpha,\beta,r$ , the lower bound for model (A) gives (8).

The proof of (ii) is analogous to (i), but now $\beta^{{}^{\prime}}=\frac{\beta}{1-\alpha}<\frac{1}{2}$ and the elevated means are $s_{i}=n^{-r}=(n^{{}^{\prime}})^{-r^{{}^{\prime}}}$ . The lower bound (5) established in (Cai et al.,, 2011, Theorem 3) translates into (10). ∎

11.3 Proofs for Section 3

Proof of Theorem 3.1.

To prove the theorem, we fix $\ell$ and apply Lemma 11.1. Write $G_{i}$ , $i=1,\ldots,i_{\mathrm{max}}$ , for the set of p-values pertaining to the intervals in the $i$ th group given by Lemma 11.1. As those intervals are disjoint, these p-values are i.i.d. $U[0,1]$ under $H_{0}$ . Further $\sum_{i=1}^{i_{\mathrm{max}}}\#G_{i}=n_{\ell}$ . Denote by $F^{(i)}$ the empirical cdf of the p-values in $G_{i}$ . Then we can write the empirical cdf of all $n_{\ell}$ p-values as $F_{n_{\ell}}=\sum_{i=1}^{i_{\mathrm{max}}}\frac{\#G_{i}}{n_{\ell}}F^{(i)}$ . Recall that $BJ_{n_{\ell}}(\ell)$ is defined by (4) evaluated at these $n_{\ell}$ p-values. Hence

[TABLE]

Since the function $(s,t)\rightarrow s\log\frac{s}{t}+(1-s)\log\frac{1-s}{1-t}$ is convex on $(0,1)^{2}$ , Jensen’s inequality gives

[TABLE]

The last inequality is conservative as we bound the weighted average of $i_{\mathrm{max}}$ Berk-Jones statistics by the worst case; obtaining a better bound is not straightforward as the Berk-Jones statistics are dependent. Setting $A:=p_{(1)}$ and $B:=p_{(n_{\ell})}$ in the proof of of the third inequality of Proposition 3.2 shows that for every $\eta>0$ , $K>1$ , and for every group $i$ :

[TABLE]

since Lemma 11.1 gives $\#G_{i}\leq n$ and $n_{\ell}\leq 144n\log_{2}n$ . Further Lemma 11.1 gives $\lfloor\frac{n}{2^{\ell}}\rfloor\leq\#G_{i}+1$ for all $i$ . For simplicity of exposition we will use $\frac{n}{2^{\ell}}\leq\#G_{i}$ (the remainder of the proof can be readily adapted to the weaker condition with standard arguments). Applying the union bound first over $i\leq i_{\mathrm{max}}$ (and noting $i_{\mathrm{max}}\leq 144\log_{2}n$ by Lemma 11.1) and then over $\ell\leq\ell_{\mathrm{max}}$ gives for $\eta=c\log\log n$ :

[TABLE]

Since $\ell_{\mathrm{max}}\leq\log_{2}n$ this bound will converge to 0 for $c>3$ and $K>1$ , proving the claim for $sBJ_{n}$ . Concerning $sHC_{n}$ , as in (19) we get

[TABLE]

Using $\frac{n}{2^{\ell}}\leq\#G_{i}$ for all $i$ , the last inequality of Proposition 3.2, and applying the union bound over $i\leq i_{\mathrm{max}}$ (noting $i_{\mathrm{max}}\leq 144\log_{2}n$ ) and $\ell\leq\ell_{\mathrm{max}}$ then gives for $\eta=B\log^{2}n$ with $B\geq 1$ :

[TABLE]

for some $C$ not depending on $B$ . The claim follows as $\ell_{\mathrm{max}}\leq\log_{2}n$ . ∎

Proof of Proposition 3.2.

It is a well known fact that $B(t)=(1+t)U(t/(1+t))$ is a standard Brownian motion, for which (Itô and McKean,, 1965, p.34) establish the following inequality:

[TABLE]

for $0<a<b\leq 1$ and $f(t)$ increasing on $(0,b]$ . Setting $f(t)=\eta\sqrt{t}$ we obtain

[TABLE]

where we used Mill’s ratio to bound the normal tail in the fourth line.

As for the second inequality, Lemma 3.1 in Duembgen and Wellner, (2014) gives for real $u$ and $c>0$ :

[TABLE]

where $l(u):=\frac{e^{u}}{1+e^{u}}$ . Set $u=\log\frac{a}{1-a}$ and $c=\log\frac{b(1-a)}{a(1-b)}>0$ , so $a=l(u)$ , $b=l(u+c)$ . Hence for any positive integer $K$ :

[TABLE]

With a view towards minimizing this expression we set $K:=\lceil c\eta\rceil$ . Then the above expression is not larger than

[TABLE]

As for the third inequality, elementary considerations show

[TABLE]

where $A=U_{(1)},B=U_{(n)}$ . For later reference it is convenient to prove the inequality for the latter statistic, i.e. the two-sided version of the Berk-Jones statistic that is based on all $n$ p-values rather than a fraction of them, and with general random limits $0\leq A<B\leq 1$ for $t$ . For ease of notation let $K>1$ be such that $K\log_{2}n$ is an integer. We will use the partition $[\frac{1}{n^{K}},\frac{1}{2}]=\bigcup_{i=1}^{K\log_{2}n-1}[(\frac{1}{2})^{i+1},(\frac{1}{2})^{i}]$ . Note that for each set in this partition we can apply the second inequality of the Proposition with the same exponential tail bound as the ratio of the right to the left endpoint is 2, hence $\log\frac{b(1-a)}{a(1-b)}\leq\log 4$ as $b\leq\frac{1}{2}$ . We can proceed analogously on $[\frac{1}{2},1-\frac{1}{n^{K}}]$ as the distribution of the statistic is symmetric about $\frac{1}{2}$ . Applying the union bound to the resulting partition of $[\frac{1}{n^{K}},1-\frac{1}{n^{K}}]$ gives

[TABLE]

For $A=U_{(1)},B=U_{(n)}$ the latter probability is not larger than $2n^{1-K}$ , proving the claim for $BJ_{n}$ .

Finally, elementary considerations show

[TABLE]

(Shorack and Wellner,, 1986, pp. 601–603) analyze $\sup_{t\in(0,1)}Z_{n}(t)$ , where $Z_{n}(t)=\sqrt{n}\frac{F_{n}(t)-t}{\sqrt{t(1-t)}}$ , by splitting $(0,1)$ into $[0,\frac{1}{n}]$ , $[\frac{1}{n},d_{n}]$ , $[d_{n},\frac{1}{2}]$ (and their reflections about $\frac{1}{2}$ ), where $d_{n}=\frac{\log^{5}n}{n}$ . The inequality they use for the first interval gives

[TABLE]

while the Shorack and Wellner inequality gives for the second interval

[TABLE]

On the interval $[d_{n},\frac{1}{2}]$ one can use exponential inequalities for the Hungarian construction, see (Shorack and Wellner,, 1986, ch. 12.1), as well as for $\frac{U(t)}{\sqrt{t(1-t)}}$ , see above. The first shows that $\sup_{t\in[d_{n},\frac{1}{2}]}|Z_{n}(t)-\frac{U(t)}{\sqrt{t(1-t)}}|$ satisfies the claimed tail bound whenever $\eta$ exceeds a certain constant, while the second gives the tail bound

[TABLE]

∎

11.4 Proofs for Section 4

Proof of Theorem 4.1.

(i) We first prove optimality for $sHC_{n}$ and then derive the conclusion for the other statistics from this result.

We will show that if $r>\rho^{*}(\alpha,\beta)$ , then $sHC_{n}=\Omega_{p}(n^{\xi})$ for some $\xi>0$ . Then the claim about $sHC_{n}$ follows with the result about the null distribution given in Theorem 3.1.

Let $\ell^{*}$ be the level that corresponds to the true length of the signal, i.e. $\ell^{*}$ satisfies $2^{\ell^{*}-1}<n^{\alpha}\leq 2^{\ell^{*}}$ . Note that $0\leq\alpha<1$ implies $\ell_{\mathrm{max}}-\ell^{*}=\Theta(\log n)$ . Further, Lemma 11.1 shows that $n_{\ell^{*}}:=\#\mathcal{I}_{\mathrm{app}}(\ell^{*})$ satisfies

[TABLE]

Below we will consider the two disjoint situations $r/(1-\alpha)<\frac{1}{4}$ and $r/(1-\alpha)\geq\frac{1}{4}$ . We define $t^{*}$ such that

[TABLE]

Solving for $t^{*}$ , we have

[TABLE]

By (17)

[TABLE]

On the event $\{p_{(n_{\ell^{*}}/2)}<t^{*}\}$ we set $t:=p_{(n_{\ell^{*}}/2)}$ to see that the sup is not smaller than

[TABLE]

(21), (20) and $t^{*}=o(1)$ show that $sHC_{n}\geq n^{\frac{1-\alpha}{2}}$ for $n$ large enough.

Now we consider the event $\{p_{(n_{\ell^{*}}/2)}\geq t^{*}\}$ . We will show below

[TABLE]

But if $p_{(1)}\leq\frac{\log^{3/2}n_{\ell^{*}}}{n_{\ell^{*}}}$ , then we have $t^{*}\geq p_{(1)}$ for $n$ large enough by (20). On the event $\{p_{(1)}\leq t^{*}\leq p_{(n_{\ell^{*}}/2)}\}$ we obtain from (21)

[TABLE]

We will show that ${\mathbb{E}}\,T_{n}(\ell^{*})=\Omega(n^{\xi})$ for some $\xi>0$ and $\sqrt{\mathrm{Var}\,T_{n}(\ell^{*})}=o({\mathbb{E}}\,T_{n}(\ell^{*}))$ . Then Chebychev’s inequality will yield the desired conclusion

[TABLE]

Recall the notation ${\boldsymbol{X}}(I):=\sum_{i\in I}X_{i}/\sqrt{|I|}$ , so ${\boldsymbol{X}}(I)\sim\mathcal{N}({\mathbb{E}}{\boldsymbol{X}}(I),1)$ . Denote $\mu^{\prime}:=\sqrt{2r\log n}(1-\frac{1}{3\sqrt{\ell_{\mathrm{max}}-\ell^{*}+4}})$ . By the construction of $\mathcal{I}_{\mathrm{app}}(\ell^{*})$ (see also Proposition 11.2(ii)), there are at least $n^{1-\alpha-\beta}$ intervals $I\in\mathcal{I}_{\mathrm{app}}(\ell^{*})$ satisfying

[TABLE]

Situation 1: If $r/(1-\alpha)<\frac{1}{4}$ , then we have:

[TABLE]

by (20) and Mill’s ratio. $\rho^{*}(\alpha,\beta)<r<\frac{1-\alpha}{4}$ implies $\frac{1-\alpha}{2}-\beta+r>0$ , so we can take $0<\xi<\frac{1-\alpha}{2}-\beta+r$ to conclude ${\mathbb{E}}\,T_{n}(\ell^{*})=\Omega(n^{\xi})$ .

In order to compute the variance of $T_{n}(\ell^{*})$ note that by Lemma 11.1 the intervals in $\mathcal{I}_{\mathrm{app}}(\ell^{*})$ can be grouped into $i_{\mathrm{max}}\leq 144\log_{2}n$ groups $\mathcal{J}_{i}(\ell^{*})$ , $i=1,\ldots,i_{\mathrm{max}}$ , each of which contains not more than $\#\mathcal{I}_{\mathrm{app}}(\ell^{*})=n_{\ell^{*}}$ disjoint intervals. Thus within each group $\mathcal{J}_{i}(\ell^{*})$ the ${\boldsymbol{X}}(I)$ are independent and therefore

[TABLE]

by (20) and since the number of $I\in\mathcal{J}_{i}(\ell^{*})$ that intersect with one of the $m=n^{1-\alpha-\beta}$ intervals that have an elevated mean can not be larger than $2m$ , and an overlap results in ${\mathbb{E}}{\boldsymbol{X}}(I)\leq\sqrt{2r\log n}$ . Applying Cauchy-Schwartz to the covariances between the $i_{\mathrm{max}}\leq 144\log_{2}n$ groups gives

[TABLE]

by (20). Since $\rho^{*}(\alpha,\beta)<r<\frac{1-\alpha}{4}$ implies $\max(0,\frac{3r-\beta}{2})<\frac{1-\alpha}{2}-\beta+r$ , we conclude $\sqrt{\mathrm{Var}\,T_{n}(\ell^{*})}=o({\mathbb{E}}\,T_{n}(\ell^{*}))$ , and thus Chebychev’s inequality gives (23).

Situation 2: When $r\geq\frac{1-\alpha}{4}$ , by a very similar calculation as above we obtain ${\mathbb{E}}\,T_{n}(\ell^{*})\geq L_{n}n^{1-\alpha-\beta}n^{-(\sqrt{1-\alpha}-\sqrt{r})^{2}}$ and $\sqrt{\mathrm{Var}\,T_{n}(\ell^{*})}=o({\mathbb{E}}\,T_{n}(\ell^{*}))$ . Since we assume $r>\max(\frac{1-\alpha}{4},\allowbreak\rho^{*}(\alpha,\beta)$ ), we can find $\xi$ with $0<\xi<1-\alpha-\beta-(\sqrt{1-\alpha}-\sqrt{r})^{2}$ , and hence (23) also follows in this situation.

Thus we have shown that $r>\rho^{*}(\alpha,\beta)$ and $p_{(1)}\leq\frac{\log^{3/2}n_{\ell^{*}}}{n_{\ell^{*}}}$ imply (23) for some $\xi>0$ , and the proof for $sHC_{n}$ will be complete once we show (22):

[TABLE]

since $\#\mathcal{J}_{i}(\ell^{*})\geq n2^{-\ell^{*}}-2$ by Lemma 11.1. Now (22) follows since $\frac{n2^{-\ell^{*}}\log^{3/2}n_{\ell^{*}}}{n_{\ell^{*}}}\geq\frac{\log^{3/2}n_{\ell^{*}}}{144\log_{2}n}\rightarrow+\infty$ by (20), completing the proof for $sHC_{n}$ .

As for $sS_{n}(s)$ , Lemma 7.2 in Jager and Wellner, (2007) shows that $K_{s}(u,v)1(v<u\leq\frac{1}{2})\leq K_{2}(u,v)1(v<u\leq\frac{1}{2})$ for all $s\in[-1,2]$ . Thus $S_{n}^{+}(s)\leq S_{n}^{+}(2)=\frac{1}{2}(HC_{n}^{+})^{2}$ and therefore $sS_{n}(s)\leq\frac{1}{2}\left(sHC_{n}\right)^{2}$ . Hence it follows from Theorem 3.1 that under the null distribution

[TABLE]

for all $-1\leq s\leq 2$ . (That theorem also provides a better bound for the special case $s=1$ .)

Now we examine the performance of $sS_{n}(s)$ when $r>\rho^{*}(\alpha,\beta)$ . As in Donoho and Jin, (2004), we need to consider two cases: $\rho^{*}(\alpha,\beta)<r<\beta/3$ and $r>(\sqrt{1-\alpha}-\sqrt{1-\alpha-\beta})^{2}$ . These two cases overlap and together cover the full region $r>\rho^{*}(\alpha,\beta)$ .

In the first case where $0<\mbox{$ \rho $}^{*}(\alpha,\beta)<r<\beta/3$ , we must have $\beta<\frac{3}{4}(1-\alpha)$ and hence $r<\frac{1-\alpha}{4}$ , and so we can choose a positive $r_{0}<r<\frac{1-\alpha}{4}$ . Let $\ell^{*}$ be the level that corresponds to the true length of the signal. Define

[TABLE]

where the p-values pertain to intervals in $\mathcal{I}_{\mathrm{app}}(\ell^{*})$ . We need the following lemma which is proved in the Appendix:

Lemma 11.3.

Let $p_{(i)}$ be the ordered p-values for intervals in $\mathcal{I}_{\mathrm{app}}(\ell^{*})$ . Then $0<\mbox{$ \rho $}^{*}(\alpha,\beta)<r<\beta/3$ implies $\ \sup_{n^{-4r}<p_{(i)}<n^{-4r_{0}}}|\frac{i}{n_{\ell^{*}}p_{(i)}}-1|\stackrel{{\scriptstyle p}}{{\rightarrow}}0$ .

Using the above lemma and Lemma 7.2 in Jager and Wellner, (2007) we have

[TABLE]

Thus

[TABLE]

for some $\xi>0$ by the above proof about $sHC_{n}$ that localized the analysis to $t^{*}=L_{n}n^{-4r}$ .

For the second case, if $r>(\sqrt{1-\alpha}-\sqrt{1-\alpha-\beta})^{2}$ and $r<1-\alpha$ , then $(r+\beta)/2\sqrt{r}<\sqrt{1-\alpha}$ . So we can pick $q\in(0,1)$ such that $\max((r+\beta)/2\sqrt{r},\sqrt{r})<\sqrt{q}<\sqrt{1-\alpha}$ . As noted above, there are at least $n^{1-\alpha-\beta}$ intervals $I\in\mathcal{I}_{\mathrm{app}}(\ell^{*})$ satisfying (24). By Lemma 11.1 the intervals in $\mathcal{I}_{\mathrm{app}}(\ell^{*})$ can be grouped into at most $144\log_{2}n$ groups such that each group consists of disjoint intervals. By the pigeonhole principle, at least one group contains more than $\frac{n^{1-\alpha-\beta}}{144\log_{2}n}$ intervals satisfying (24). Since the ${\boldsymbol{X}}(I)$ in that group are independent we have

[TABLE]

by Chebychev’s inequality, since $\xi:=1-\alpha-\beta-(\sqrt{q}-\sqrt{r})^{2}>1-\alpha-q>0$ . Setting $t:=\sqrt{2q\log n}$ we get

[TABLE]

since $n_{\ell^{*}}\bar{\Phi}(t)=L_{n}n^{1-\alpha-q}$ by (20) and $\xi>1-\alpha-q$ .

Together with Lemma 7.2 in Jager and Wellner, (2007) and (17) we obtain

[TABLE]

It follows with (27) and (20) that

[TABLE]

From equations (25), (26) and (28) it follows that for all $-1\leq s\leq 2$ , $sS_{n}(s)$ has asymptotic power 1 under the alternative $r>\rho^{*}(\alpha,\beta)$ . ∎

Proof of Theorem 4.1.

(ii) The null case was discussed in Theorem 3.1 which showed $sHC_{n}=O_{p}(\log^{2}n)$ . As in the proof of part (i), we only need to show that $sHC_{n}=\Omega_{p}(n^{\xi})$ for some $\xi>0$ when $r>\rho^{*}(\alpha,\beta)$ . Again, let $\ell^{*}$ be the level that corresponds to the true length of the signal, i.e. $\ell^{*}$ satisfies $2^{\ell^{*}-1}<n^{\alpha}\leq 2^{\ell^{*}}$ . $0\leq\alpha<1$ implies $\ell_{\mathrm{max}}-\ell^{*}=\Theta(\log n)$ . By the construction of $\mathcal{I}_{\mathrm{app}}(\ell^{*})$ (see also Proposition 11.2(ii)) there are at least $n^{1-\alpha-\beta}$ intervals $I\in\mathcal{I}_{\mathrm{app}}(\ell^{*})$ satisfying ${\mathbb{E}}({\boldsymbol{X}}(I))\geq\ n^{r}(1-\frac{1}{3\sqrt{\ell_{\mathrm{max}}-\ell^{*}+4}})\geq n^{r}/2$ . Hence

[TABLE]

since $\bar{\Phi}^{\prime}\leq-\frac{1}{4}$ on $\Bigl{(}\bar{\Phi}^{-1}(\frac{1}{4})-\frac{1}{2},\bar{\Phi}^{-1}(\frac{1}{4})\Bigr{)}$ and we may w.l.o.g. assume that $n^{r}<1$ since $\rho^{*}<0$ .

By Lemma 11.1, the intervals in $\mathcal{I}_{\mathrm{app}}(\ell^{*})$ can be grouped into at most $144\log_{2}n$ groups, each of which contains not more than $\#\mathcal{I}_{\mathrm{app}}(\ell^{*})=n_{\ell^{*}}$ disjoint intervals. Thus within each group the ${\boldsymbol{X}}(I)$ are independent, and applying Cauchy-Schwartz to the covariances between groups gives

[TABLE]

Together with (29) and (20) this shows that

[TABLE]

satisfies ${\mathbb{E}}\,T_{n}(\ell^{*})\geq L_{n}n^{\frac{1-\alpha}{2}-\beta+r}$ and $\mathrm{Var}\,T_{n}(\ell^{*})\leq L_{n}$ , hence Chebychev and $r>\beta-\frac{1-\alpha}{2}$ yield

[TABLE]

for some $\xi>0$ .

Now we partition the sample space into three events:

[TABLE]

by (30), (20) and ${\rm P}(\frac{1}{4}<p_{(1)})\leq(\frac{3}{4})^{n_{\ell^{*}}}\rightarrow 0$ . ∎

11.5 Proofs for Section 5

Proof of Theorem 5.1.

In the dense case let $r:=\beta-\frac{1-\alpha}{2}+\epsilon$ for some $\epsilon>0$ . Then $\sum_{i=1}^{n}X_{i}/\sqrt{n}$ is normal with variance one and mean $n^{1-\beta}\mu/\sqrt{n}=n^{1-\beta+r-\frac{1+\alpha}{2}}=n^{\epsilon}$ . Thus $P_{n}\geq\sum_{i=1}^{n}X_{i}/\sqrt{n}-\sqrt{2}\stackrel{{\scriptstyle p}}{{\rightarrow}}\infty$ . Hence $P_{n}$ has asymptotic power one since $P_{n}=O_{p}(1)$ under $H_{0}$ .

In the sparse case, if $r>\rho_{\mathrm{pen}}^{*}(\alpha,\beta)$ then we can pick a constant $\epsilon>0$ depending only on $(r,\alpha,\beta)$ such that $1-\alpha-\beta>((1+\epsilon)\sqrt{1-\alpha}-\sqrt{r})^{2}$ . For a block $I_{g}=:(j,j+n^{\alpha}]$ in (6) write $Z_{g}:=\sum_{i=j+1}^{j+n^{\alpha}}X_{i}/\sqrt{n^{\alpha}}$ . In order to show that $P_{n}$ has asymptotic power one, it is enough to show that

[TABLE]

because $P_{n}=O_{p}(1)$ under $H_{0}$ while $\epsilon\sqrt{2\log\frac{n}{n^{\alpha}}}=\epsilon\sqrt{2(1-\alpha)\log n}\rightarrow\infty$ .

Note that the $Z_{g}$ are independent normal with mean $\sqrt{n^{\alpha}}\mu=\sqrt{2r\log n}$ and variance one. Therefore

[TABLE]

by Mill’s ratio. Hence the probability in (31) equals

[TABLE]

and $mp_{n}\geq L_{n}n^{1-\alpha-\beta-(\sqrt{r}-(1+\epsilon)\sqrt{1-\alpha})^{2}}\rightarrow\infty$ .

The claim for $P_{n}^{\mathrm{app}}$ obtains in the same way, by taking account of the approximation error incurred by using the approximating set, see (Rivera and Walther,, 2013, Theorem 2) and (Kou,, 2017, Theorem 11).

Proceeding as in the proof of Theorem 1.4 in Donoho and Jin, (2004), it can be shown that $P_{n}$ and $P_{n}^{{\mathrm{app}}}$ are powerless if $r<\rho_{\mathrm{pen}}^{*}(\alpha,\beta)$ . ∎

11.6 Proofs for Section 7

The following lemma is the bivariate analogue of Lemma 11.1:

Lemma 11.4.

The rectangles in $\mathcal{I}_{\mathrm{app}}^{(2)}(\ell)$ can be grouped into at most $12\epsilon_{\ell}^{-4}(\ell+1)\leq 8\cdot 6^{5}(\log_{2}n)^{2}(\ell+1)$ groups such that each group consists of at least $\frac{9}{16}\frac{n^{2}}{2^{\ell}}$ and at most $2\frac{n^{2}}{2^{\ell}}$ disjoint rectangles. Hence $\#\mathcal{I}_{\mathrm{app}}^{(2)}(\ell)\leq 16\cdot 6^{5}(\log_{2}n)^{2}n^{2}\frac{\ell+1}{2^{\ell}}$ .

Proof of Lemma 11.4.

We will use the following refinement of Lemma 11.1 for the univariate setting:

Claim 1.

The intervals in $\mathcal{I}_{\mathrm{app}}(\ell)$ that have a given length $L$ (which hence is a multiple of $d_{\ell}$ ) can be grouped into $\frac{L}{d_{\ell}}$ groups such that each group consists of either $\lfloor\frac{n}{L}\rfloor$ or $\lfloor\frac{n}{L}\rfloor-1$ disjoint intervals.

To see this, set $I_{j}:=(jd_{\ell},jd_{\ell}+L]$ for $j=0,\ldots,\frac{L}{d_{\ell}}-1$ , and consider all possible shifts of $I_{j}$ by multiples of $L$ :

[TABLE]

One readily checks that $\bigcup_{j=0}^{\frac{L}{d_{\ell}}-1}C(j)$ equals the collection of all intervals in $\mathcal{I}_{\mathrm{app}}(\ell)$ that have length $L$ . Further, each $C(j)$ consists of $\lfloor\frac{n}{L}\rfloor$ or $\lfloor\frac{n}{L}\rfloor-1$ intervals that are disjoint, proving Claim 1.

Now we consider the rectangles in $\mathcal{I}_{\mathrm{app}}^{(2)}(\ell)$ that have given sidelengths $L_{1}$ and $L_{2}$ :

[TABLE]

where $\ell_{i}=\lceil\log_{2}L_{i}\rceil$ .

Claim 2.

The rectangles in $C(\ell,L_{1},L_{2})$ can be grouped into at most $4\epsilon_{\ell}^{-2}\leq 4\cdot 6^{2}\log_{2}n^{2}$ groups such that each group consists of at least $(\lfloor\frac{n}{L_{1}}\rfloor-1)(\lfloor\frac{n}{L_{2}}\rfloor-1)\geq\frac{9}{16}\frac{n^{2}}{2^{\ell}}$ and at most $\lfloor\frac{n}{L_{1}}\rfloor\lfloor\frac{n}{L_{2}}\rfloor\leq 2\frac{n^{2}}{2^{\ell}}$ disjoint rectangles.

In order to prove Claim 2, note that Claim 1 implies that the rectangles in $C(\ell,L_{1},L_{2})$ can be grouped into $\frac{L_{1}}{d_{\ell_{1}}}\times\frac{L_{2}}{d_{\ell_{2}}}\leq\frac{4}{\epsilon_{\ell}^{2}}\leq 4\cdot 6^{2}\log_{2}n^{2}$ groups such that each group contains between $(\lfloor\frac{n}{L_{1}}-1\rfloor)(\lfloor\frac{n}{L_{2}}-1\rfloor)$ and $\lfloor\frac{n}{L_{1}}\rfloor\lfloor\frac{n}{L_{2}}\rfloor$ rectangles that are disjoint (since the Cartesian product of two collections of disjoint intervals yields a collection of disjoint rectangles). Since the area of the rectangles satisfies $L_{1}L_{2}\in(2^{\ell-1},2^{\ell}]$ we get $\lfloor\frac{n}{L_{1}}\rfloor\lfloor\frac{n}{L_{2}}\rfloor\leq 2\frac{n^{2}}{2^{\ell}}$ . Finally, $L_{i}\leq n/8$ implies $(\lfloor\frac{n}{L_{1}}-1\rfloor)(\lfloor\frac{n}{L_{2}}-1\rfloor)\geq(\frac{3}{4}\frac{n}{L_{1}})(\frac{3}{4}\frac{n}{L_{2}})\geq\frac{9}{16}\frac{n^{2}}{2^{\ell}}$ , establishing Claim 2.

The lemma now obtains as follows: Clearly, $\mathcal{I}_{\mathrm{app}}^{(2)}(\ell)=\bigcup_{\{\mbox{all possible$ L_{1},L_{2} $}\}}C(\ell,L_{1},L_{2})$ . Since the level $\ell_{1}$ of $L_{1}$ must satisfy $\ell_{1}\leq\ell$ and each $\mathcal{I}_{\mathrm{app}}(\tilde{\ell},\epsilon_{\ell})$ admits at most $\lceil 2^{\tilde{\ell}-1}/d_{\tilde{\ell}}\rceil\leq\lceil 1/\epsilon_{\ell}\rceil$ different interval lengths, there are at most $\lceil 1/\epsilon_{\ell}\rceil(\ell+1)$ different choices for $L_{1}$ . The constraint $\ell\leq\ell_{1}+\ell_{2}\leq\ell+1$ from Proposition 11.2 implies that given $L_{1}$ , the level $\ell_{2}$ of $L_{2}$ must be either $\ell-\ell_{1}$ or $\ell-\ell_{1}+1$ , hence there are at most $\lceil 2/\epsilon_{\ell}\rceil$ different choices for $L_{2}$ . So there are at most $\frac{3}{\epsilon_{\ell}^{2}}(\ell+1)\leq 3\cdot 6^{2}(\log_{2}n^{2})(\ell+1)$ different choices for $(L_{1},L_{2})$ . Lemma 11.4 now follows with Claim 2. We note that the statement of the lemma can be sharpened somewhat as the factor $\frac{9}{16}$ is due to large rectangles which allow a better bound on $12\epsilon_{\ell}^{-4}(\ell+1)$ . ∎

Proof of Theorem 7.1.

The proof follows that of Theorem 3.1 using the inequalities from Lemma 11.4 in place of Lemma 11.1. That is, for a fixed $\ell$ we now have $n_{\ell}\leq 16\cdot 6^{5}(\log_{2}n)^{2}n^{2}$ , $\frac{9}{16}\frac{n^{2}}{2^{\ell}}\leq\#G_{i}\leq n^{2}$ , $i_{\mathrm{max}}\leq 16\cdot 6^{5}(\log_{2}n)^{3}$ and $\ell_{\mathrm{max}}+1\leq 2\log_{2}n$ . As for $sBJ_{n}^{(2)}$ , the two additional factors of $\log_{2}n$ in $i_{\mathrm{max}}$ and the factor $\frac{9}{16}$ in the lower bound for $\#G_{i}$ necessitate to replace the condition $c>3$ by $c>(3+2)\frac{16}{9}$ in order to obtain the desired convergence to 0. This bound on $c$ can be improved somewhat by refining the bounds in Lemma 11.4 as explained at the end of its proof. Concerning $sHC_{n}^{(2)}$ , the convergence rate needs to account for the two additional factors of $\log_{2}n$ in $i_{\mathrm{max}}$ . ∎

Proof of Theorem 7.2.

The proof of the lower bound is analogous to that of Theorem 2.1 by considering the submodel obtained by partitioning the $n\times n$ grid into $n^{\prime}=n^{2-2\alpha}$ blocks of size $|I|=n^{2\alpha}$ . The claim about $sHC_{n}^{(2)}$ and $sBJ_{n}^{(2)}$ follows as in Theorem 4.1 by using $n^{2}$ in place of $n$ . ∎

11.7 Proofs for Section 8

Proof of Proposition 8.1.

There are at most $\frac{n}{d_{\ell}}\leq n2^{\frac{-\ell+1}{2}}\sqrt{\log_{2}n^{2}}$ indices $j$ in $\mathcal{C}_{\mathrm{app}}(\ell)$ and likewise for $k$ , while there are at most $\frac{1}{\epsilon_{\ell}}+1\leq\sqrt{\log_{2}n^{2}}+1$ indices $i$ . Hence $\#\mathcal{C}_{\mathrm{app}}(\ell)\leq 2n^{2}2^{-\ell}(\sqrt{\log_{2}n^{2}}+1)^{3}$ and (i) follows.

As for (ii), by the assumption on $R^{2}$ there exists $\ell\in\{0,\ldots,\lceil\log_{2}\frac{n^{2}}{8}\rceil\}$ such that $2^{\ell-1}<R^{2}\leq 2^{\ell}$ . We can now find a $B_{r_{i}}(j,k)\in\mathcal{C}_{\mathrm{app}}(\ell)$ with the desired property: Let $i$ be the largest integer such that $r_{i}\leq R^{2}$ . Then by the construction of $r_{i}$ we have $r_{i}^{2}/R^{2}\geq 2^{-\epsilon_{\ell}}\geq 1-\epsilon_{\ell}$ . Let $j$ and $k$ be the elements in $\{m\,d_{\ell},m\in{\mathbb{N}}\}\cap[r_{i},n-r_{i}+1]$ that are closest to $s$ and $t$ , respectively. Then $|j-s|\leq d_{\ell}$ , $|k-t|\leq d_{\ell}$ , and therefore the Euclidean distance between $(j,k)$ and $(s,t)$ is not larger than $\sqrt{2}d_{\ell}$ . Thus it follows from Lemma 11.5 below that

[TABLE]

∎

Lemma 11.5.

Let $0<r\leq R$ and $d\in\mathbb{R}^{2}$ . Then

[TABLE]

Proof of Lemma 11.5.

[TABLE]

If $|d|\leq 2R$ , then $B_{R}(d)\cap B_{R}(0)$ is the union of two circular segments with equal area. The formula for a circular segment gives

[TABLE]

Hence (32) is not larger than $R^{2}\pi-r^{2}\pi+2\pi|d|R$ . The lemma follows as it trivially also holds in the case $|d|>2R$ . ∎

Proof of Theorem 8.2.

The claims about the null distribution follow as in the case of univariate intervals (Theorem 3.1) and multivariate rectangles (Theorem 7.1). The key argument is again to show that the balls in $\mathcal{C}_{\mathrm{app}}(\ell)$ can be grouped into a small number of groups each consisting of $\sim\frac{n^{2}}{2^{\ell+2}}$ disjoint balls. To this end, define $L_{\ell}$ to be the smallest multiple of $d_{\ell}$ not smaller than $\max_{i}2r_{i}$ , so $L_{\ell}\sim 2\sqrt{2^{\ell}}$ . Define

[TABLE]

By construction, the balls in $\mbox{shift}_{\ell}(j,k,r_{i})$ are mutually disjoint. One readily checks

[TABLE]

There are $\sim\Bigl{(}\frac{n}{L_{\ell}}\Bigr{)}^{2}\sim\frac{n^{2}}{2^{\ell+2}}$ balls in $\mbox{shift}_{\ell}(j,k,r_{i})$ , and the number of groups is $\sim\Bigl{(}\frac{L_{\ell}}{d_{\ell}}\Bigr{)}^{2}\frac{1}{\epsilon_{\ell}}\sim 8\epsilon_{\ell}^{-3}\leq 8(\log n)^{\frac{3}{2}}$ . The latter number has an additional factor $(\log n)^{\frac{1}{2}}$ compared to the case of univariate intervals, which likewise affects the convergence rate of $sHC_{n}^{(2)}$ as is clear from the proof of Theorem 3.1. The proof of the optimality properties follows that of Theorem 7.2. ∎

Appendix

Proof of Lemma 11.3.

Note that using the same considerations as in (17) we obtain

[TABLE]

By Lemma 11.1 the intervals in $\mathcal{I}_{\mathrm{app}}(\ell^{*})$ can be grouped into $i_{\mathrm{max}}\leq 144\log_{2}n$ groups, each of which consists of the same (up to $\pm 1$ ) number $N_{\ell^{*}}=L_{n}n^{1-\alpha}$ of disjoint intervals as the first group. Let $I_{1},\ldots,I_{N_{\ell^{*}}}$ denote the intervals in the first group. Then for $\epsilon\in(0,1)$

[TABLE]

since there are not more than $n$ p-values in $(n^{-4r},n^{-4r_{0}})$ . The $I_{i}$ being disjoint implies that the ${\boldsymbol{X}}(I_{i})$ are independent and that at most $2n^{1-\alpha-\beta}$ of the $I_{i}$ can intersect with one of the $n^{1-\alpha-\beta}$ intervals that have an elevated mean. Such an overlap results in ${\mathbb{E}}{\boldsymbol{X}}(I)\leq\sqrt{3r\log n}$ . Thus under $H_{1}$

[TABLE]

Note that the function $\frac{\bar{\Phi}(t)}{\bar{\Phi}(t-\sqrt{2r\log n})}$ is decreasing in $t$ as can be seen by differentiating and employing the increasing hazard rate property of the normal distribution. So if we define $t^{*}$ via $\bar{\Phi}(t^{*})=n^{-4r}$ , then $t^{*}=(1+o(1))\sqrt{2(4r)\log n}$ and

[TABLE]

Hence for $n\geq n_{0}$ the above inf is larger than $4/\epsilon$ and so together with (34) we get

[TABLE]

Now we use Bennett’s inequality, which gives

[TABLE]

Thus (35) is not larger than

[TABLE]

for some $\kappa>0$ as $r<\beta/3$ requires $\beta<\frac{3}{4}(1-\alpha)$ and hence $r<(1-\alpha)/4$ . The left tail probability in (33) is easily bounded analogously using the left inequality in (34). Hence (33) is not larger than

[TABLE]

∎

Acknowledgements

The authors were supported by NSF grants DMS-1220311 and DMS-1501767.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arias-Castro et al., (2011) Arias-Castro, E., Candes, E., and Durand, A. (2011). Detection of an anomalous cluster in a network. The Annals of Statistics , 39(1):278–304.
2Arias-Castro et al., (2005) Arias-Castro, E., Donoho, D. L., and Huo, X. (2005). Near-optimal detection of geometric objects by fast multiscale methods. IEEE Transactions on Information Theory , 51(7):2402–2425.
3Cai et al., (2011) Cai, T. T., Jeng, X. J., and Jin, J. (2011). Optimal detection of heterogeneous and heteroscedastic mixtures. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 73(5):629–662.
4Chan, (2009) Chan, H. P. (2009). Detection of spatial clustering with average likelihood ratio test statistics. The Annals of Statistics , 37(6B):3985–4010.
5Chan and Walther, (2013) Chan, H. P. and Walther, G. (2013). Detection with the scan and the average likelihood ratio. Statistica Sinica , 23:409–428.
6Delaigle and Hall, (2009) Delaigle, A. and Hall, P. (2009). Higher criticism in the context of unknown distribution, non-independence and classification. Perspectives in Math-ematical Sciences I: Probability and Statistics , pages 109–138.
7Donoho and Jin, (2004) Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics , 32:962–994.
8Duembgen and Wellner, (2014) Duembgen, L. and Wellner, J. A. (2014). Confidence bands for distribution functions: A new look at the law of the iterated logarithm. ar Xiv:1402.2918 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Large-scale inference with block structure

Abstract

1 Introduction

1.1 Review of sparse signal detection

1.2 Organization of the paper and notation

2 The multiple blocks model

2.1 The detection boundary for the multiple blocks model

Theorem 2.1**.**

3 The structured higher criticism and Berk-Jones

Theorem 3.1**.**

Proposition 3.2**.**

4 Optimality of the structured higher criticism and structured Berk-Jones

Theorem 4.1**.**

Corollary 4.2**.**

Corollary 4.3**.**

5 Comparison with other

Theorem 5.1**.**

5.1 Discussion: What matters

6 Simulation study

6.1 Simulation results for the very sparse case

6.2 Simulation results for the sparse case

6.3 Simulation result for dense case

7 The multivariate case

Theorem 7.1**.**

Theorem 7.2**.**

8 Clusters in a network on the square lattice

Proposition 8.1**.**

Theorem 8.2**.**

9 Composite alternatives

10 Discussion

11 Proofs

11.1 Some basic results

Lemma 11.1**.**

Proof of Lemma 11.1.

Proposition 11.2**.**

Proof of Proposition 11.2.

11.2 Proofs for Section 2

Proof of Theorem 2.1.

11.3 Proofs for Section 3

Proof of Theorem 3.1.

Proof of Proposition 3.2.

11.4 Proofs for Section 4

Proof of Theorem 4.1.

Lemma 11.3**.**

Proof of Theorem 4.1.

11.5 Proofs for Section 5

Proof of Theorem 5.1.

11.6 Proofs for Section 7

Lemma 11.4**.**

Proof of Lemma 11.4.

Claim 1**.**

Claim 2**.**

Proof of Theorem 7.1.

Proof of Theorem 7.2.

11.7 Proofs for Section 8

Proof of Proposition 8.1.

Lemma 11.5**.**

Proof of Lemma 11.5.

Proof of Theorem 8.2.

Appendix

Proof of Lemma 11.3.

Acknowledgements

Theorem 2.1.

Theorem 3.1.

Proposition 3.2.

Theorem 4.1.

Corollary 4.2.

Corollary 4.3.

Theorem 5.1.

Theorem 7.1.

Theorem 7.2.

Proposition 8.1.

Theorem 8.2.

Lemma 11.1.

Proposition 11.2.

Lemma 11.3.

Lemma 11.4.

Claim 1.

Claim 2.

Lemma 11.5.