Covariate powered cross-weighted multiple testing

Nikolaos Ignatiadis; Wolfgang Huber

arXiv:1701.05179·stat.ME·September 1, 2021

Covariate powered cross-weighted multiple testing

Nikolaos Ignatiadis, Wolfgang Huber

PDF

3 Repos

TL;DR

This paper introduces a covariate-powered method called IHW that enhances multiple testing procedures by leveraging covariate information to increase detection power while maintaining false discovery rate control.

Contribution

It develops a data-driven weighting framework for multiple testing that incorporates covariates, providing finite-sample FDR guarantees through a cross-weighting approach.

Findings

01

IHW outperforms traditional methods lacking covariate information.

02

The approach maintains FDR control under dependence within folds.

03

Covariate-weighted p-values improve hypothesis ranking for rejection.

Abstract

A fundamental task in the analysis of datasets with many variables is screening for associations. This can be cast as a multiple testing task, where the objective is achieving high detection power while controlling type I error. We consider $m$ hypothesis tests represented by pairs $((P_{i}, X_{i}))_{1 \leq i \leq m}$ of p-values $P_{i}$ and covariates $X_{i}$ , such that $P_{i} ⊥ X_{i}$ if $H_{i}$ is null. Here, we show how to use information potentially available in the covariates about heterogeneities among hypotheses to increase power compared to conventional procedures that only use the $P_{i}$ . To this end, we upgrade existing weighted multiple testing procedures through the Independent Hypothesis Weighting (IHW) framework to use data-driven weights that are calculated as a function of the covariates. Finite sample guarantees, e.g., false discovery rate (FDR) control, are derived from…

Figures2

Click any figure to enlarge with its caption.

Equations374

Reject hypothesis i ⟺ P_{i} \leq \hat{t}

Reject hypothesis i ⟺ P_{i} \leq \hat{t}

Reject hypothesis i ⟺ P_{i} \leq \hat{t} \cdot W^{- ℓ} (X_{i}) where i \in I_{ℓ},

Reject hypothesis i ⟺ P_{i} \leq \hat{t} \cdot W^{- ℓ} (X_{i}) where i \in I_{ℓ},

Reject hypothesis i ⟺ P_{i} \leq min {w_{i} \cdot \hat{t}, τ}

Reject hypothesis i ⟺ P_{i} \leq min {w_{i} \cdot \hat{t}, τ}

\hat{t} = \frac{α k ^}{m}, \hat{k} = max {k \in N_{\geq 0} ∣ P_{i} \leq (\frac{α w _{i} k}{m}) \land τ for at least k p-values}

\hat{t} = \frac{α k ^}{m}, \hat{k} = max {k \in N_{\geq 0} ∣ P_{i} \leq (\frac{α w _{i} k}{m}) \land τ for at least k p-values}

π_{0} (g) := \frac{1 + \sum _{i : X_{i} = g} 1 ( P _{i} > τ )}{∣ { i : X _{i} = g } ∣ ( 1 - τ )} \land 1

π_{0} (g) := \frac{1 + \sum _{i : X_{i} = g} 1 ( P _{i} > τ )}{∣ { i : X _{i} = g } ∣ ( 1 - τ )} \land 1

W_{i}:=\frac{1-\widehat{\pi}_{0}(X_{i})}{\widehat{\pi}_{0}(X_{i})}\bigg{/}\sum_{i=1}^{m}\frac{1-\widehat{\pi}_{0}(X_{i})}{m\cdot\widehat{\pi}_{0}(X_{i})}

W_{i}:=\frac{1-\widehat{\pi}_{0}(X_{i})}{\widehat{\pi}_{0}(X_{i})}\bigg{/}\sum_{i=1}^{m}\frac{1-\widehat{\pi}_{0}(X_{i})}{m\cdot\widehat{\pi}_{0}(X_{i})}

\overset{π}{^}_{0, W}^{'} := \frac{i = 1 , \dots , m max W _{i} + i = 1 \sum m W _{i} 1 ( P _{i} > τ )}{m ( 1 - τ )}

\overset{π}{^}_{0, W}^{'} := \frac{i = 1 , \dots , m max W _{i} + i = 1 \sum m W _{i} 1 ( P _{i} > τ )}{m ( 1 - τ )}

π_{0}^{- ℓ} (g) := \frac{1 + \sum _{i \in / I_{ℓ} : X_{i} = g} 1 ( P _{i} > τ )}{∣ { i \in / I _{ℓ} : X _{i} = g } ∣ ( 1 - τ )} \land 1

π_{0}^{- ℓ} (g) := \frac{1 + \sum _{i \in / I_{ℓ} : X_{i} = g} 1 ( P _{i} > τ )}{∣ { i \in / I _{ℓ} : X _{i} = g } ∣ ( 1 - τ )} \land 1

W_{i}:=\frac{1-\widehat{\pi}_{0}^{-\ell}(X_{i})}{\widehat{\pi}^{-\ell}_{0}(X_{i})}\bigg{/}\sum_{\smash{i\in I_{\ell}}}\frac{1-\widehat{\pi}_{0}^{-\ell}(X_{i})}{|I_{\ell}|\cdot\widehat{\pi}_{0}^{-\ell}(X_{i})}

W_{i}:=\frac{1-\widehat{\pi}_{0}^{-\ell}(X_{i})}{\widehat{\pi}^{-\ell}_{0}(X_{i})}\bigg{/}\sum_{\smash{i\in I_{\ell}}}\frac{1-\widehat{\pi}_{0}^{-\ell}(X_{i})}{|I_{\ell}|\cdot\widehat{\pi}_{0}^{-\ell}(X_{i})}

\overset{π}{^}_{0, W, ℓ}^{'} := \frac{i \in I _{ℓ} max W _{i} + i \in I _{ℓ} \sum W _{i} 1 ( P _{i} > τ )}{∣ I _{ℓ} ∣ ( 1 - τ )}

\overset{π}{^}_{0, W, ℓ}^{'} := \frac{i \in I _{ℓ} max W _{i} + i \in I _{ℓ} \sum W _{i} 1 ( P _{i} > τ )}{∣ I _{ℓ} ∣ ( 1 - τ )}

W_{i} := \frac{∣ I _{ℓ} ∣ W ^{- ℓ} ( X _{i} )}{\sum _{i \in I_{ℓ}} W ^{- ℓ} ( X _{i} )}, if i \in I_{ℓ} \sum W^{- ℓ} (X_{i}) > 0, else W_{i} := 1

W_{i} := \frac{∣ I _{ℓ} ∣ W ^{- ℓ} ( X _{i} )}{\sum _{i \in I_{ℓ}} W ^{- ℓ} ( X _{i} )}, if i \in I_{ℓ} \sum W^{- ℓ} (X_{i}) > 0, else W_{i} := 1

\overset{π}{^}_{0, W, ℓ}^{'} = \frac{( i \in I _{ℓ} max W _{i} ) + i \in I _{ℓ} \sum W _{i} 1 ( P _{i} > τ ^{'} )}{∣ I _{ℓ} ∣ ( 1 - τ ^{'} )} with τ^{'} \in [τ, 1),

\overset{π}{^}_{0, W, ℓ}^{'} = \frac{( i \in I _{ℓ} max W _{i} ) + i \in I _{ℓ} \sum W _{i} 1 ( P _{i} > τ ^{'} )}{∣ I _{ℓ} ∣ ( 1 - τ ^{'} )} with τ^{'} \in [τ, 1),

X_{i} \sim P^{X}, H_{i} ∣ (X_{i} = x) \sim Bernoulli (1 - π_{0} (x)),

X_{i} \sim P^{X}, H_{i} ∣ (X_{i} = x) \sim Bernoulli (1 - π_{0} (x)),

P_{i} ∣ (H_{i} = 0, X_{i} = x) \sim U [0, 1], P_{i} ∣ (H_{i} = 1, X_{i} = x) \sim F_{alt} (\cdot ∣ X_{i} = x)

\int W^{(I)} (x)^{2} d P^{X} (x) \leq Γ \cdot (\int W^{(I)} (x) d P^{X} (x))^{2} for all subsets I \subset N .

\int W^{(I)} (x)^{2} d P^{X} (x) \leq Γ \cdot (\int W^{(I)} (x) d P^{X} (x))^{2} for all subsets I \subset N .

W^{([m])} (\cdot) - W^{*} (\cdot)_{\infty} ⟶ P 0 as m \to \infty, \int W^{*} (x) d P^{X} (x) = 1, \int W^{*} (x)^{2} d P^{X} (x) < \infty

W^{([m])} (\cdot) - W^{*} (\cdot)_{\infty} ⟶ P 0 as m \to \infty, \int W^{*} (x) d P^{X} (x) = 1, \int W^{*} (x)^{2} d P^{X} (x) < \infty

k -FWER

k -FWER

= \frac{1}{k} i \in H_{0} \sum E [P [P_{i} \leq \frac{k α W _{i}}{m} ∣ W_{i}]] \leq (*) \frac{1}{k} i \in H_{0} \sum E [\frac{k α W _{i}}{m}] = \frac{α}{m} E [i \in H_{0} \sum W_{i}] \leq α .

(W_{i})_{i \in I_{ℓ}} \in w \in [0, \infty)^{∣ I_{ℓ} ∣} argmax {i \in I_{ℓ} \sum F^{- ℓ} (k α / m \cdot w_{i} ∣ X_{i}) w_{i} \geq 0, i \in I_{ℓ} \sum w_{i} = ∣ I_{ℓ} ∣} .

(W_{i})_{i \in I_{ℓ}} \in w \in [0, \infty)^{∣ I_{ℓ} ∣} argmax {i \in I_{ℓ} \sum F^{- ℓ} (k α / m \cdot w_{i} ∣ X_{i}) w_{i} \geq 0, i \in I_{ℓ} \sum w_{i} = ∣ I_{ℓ} ∣} .

π_{0} (x) = expit (a_{0} + a^{⊤} x), where expit (u) = exp (u) / (1 + exp (u))

π_{0} (x) = expit (a_{0} + a^{⊤} x), where expit (u) = exp (u) / (1 + exp (u))

F_{alt} (\cdot ∣ X_{i} = x) = Beta (β (x), 1), β (x) = b_{0} + b^{⊤} x .

\displaystyle\mathbf{t}=(t_{i})_{i\in I_{\ell}}\;\in\;\operatorname*{argmax}_{\mathbf{t}\in[0,1]^{\left\lvert I_{\ell}\right\rvert}}\left\{\sum_{i\in I_{\ell}}\widehat{F}^{-\ell}\left(t_{i}\mid X_{i}\right)\;\;\,\big{|}\,\;\;t_{i}\geq 0,\;\;\sum_{i\in I_{\ell}}\widehat{\pi}_{0}^{-\ell}(X_{i})t_{i}\leq\alpha\sum_{i\in I_{\ell}}\widehat{F}^{-\ell}\left(t_{i}\mid X_{i}\right)\right\}.

\displaystyle\mathbf{t}=(t_{i})_{i\in I_{\ell}}\;\in\;\operatorname*{argmax}_{\mathbf{t}\in[0,1]^{\left\lvert I_{\ell}\right\rvert}}\left\{\sum_{i\in I_{\ell}}\widehat{F}^{-\ell}\left(t_{i}\mid X_{i}\right)\;\;\,\big{|}\,\;\;t_{i}\geq 0,\;\;\sum_{i\in I_{\ell}}\widehat{\pi}_{0}^{-\ell}(X_{i})t_{i}\leq\alpha\sum_{i\in I_{\ell}}\widehat{F}^{-\ell}\left(t_{i}\mid X_{i}\right)\right\}.

Power := E [\frac{\sum _{i \in / H_{0}} 1 ( i rejected )}{max { 1 , m - ∣ H _{0} ∣ }}]

Power := E [\frac{\sum _{i \in / H_{0}} 1 ( i rejected )}{max { 1 , m - ∣ H _{0} ∣ }}]

\tilde{X}_{i} = ⌊ 40 \cdot (i - 1) / m ⌋, X_{i} = ⌈ \tilde{X}_{i} /40 \cdot G ⌉

\tilde{X}_{i} = ⌊ 40 \cdot (i - 1) / m ⌋, X_{i} = ⌈ \tilde{X}_{i} /40 \cdot G ⌉

H_{i} ∣ \tilde{X}_{i} \sim Bernoulli (1 - π_{0} (\tilde{X}_{i})), π_{0} (\tilde{X}_{i}) = (0.2 + 0.8 \tilde{X}_{i} /36) \cdot 1 (\tilde{X}_{i} = 0 mod 4) + 1 (\tilde{X}_{i} \neq = 0 mod 4)

Z_{i} ∣ H_{i}, \tilde{X}_{i} \sim N (H_{i} \cdot μ (\tilde{X}_{i}), 1), μ (\tilde{X}_{i}) = 2.5 - 2 \tilde{X}_{i} /36

P_{i} = 1 - Φ (Z_{i}), Φ is the standard Normal CDF

P^{X} = U [0, 1]^{2}, π_{0} (x) = 0.98 \cdot 1 (x_{1}^{2} + x_{2}^{2} \leq 1) + 0.6 \cdot 1 (x_{1}^{2} + x_{2}^{2} > 1), (E [π_{0} (X_{i})] \approx 0.9)

P^{X} = U [0, 1]^{2}, π_{0} (x) = 0.98 \cdot 1 (x_{1}^{2} + x_{2}^{2} \leq 1) + 0.6 \cdot 1 (x_{1}^{2} + x_{2}^{2} > 1), (E [π_{0} (X_{i})] \approx 0.9)

\displaystyle F_{\text{alt}}(\cdot\mid X_{i}=x)=\text{Beta}(\beta(x),1),\;\;\beta(x)=1\big{/}\max\left\{1.3,\bar{\beta}\cdot(\sqrt{x_{1}}+\sqrt{x_{2}})\right\}

FDP_{j} = \frac{1 + ∣ { i : P _{i} \geq 1 - s _{j} ( X _{i} )} ∣}{∣ { i : P _{i} \leq s _{j} ( X _{i} )} ∣} .

FDP_{j} = \frac{1 + ∣ { i : P _{i} \geq 1 - s _{j} ( X _{i} )} ∣}{∣ { i : P _{i} \leq s _{j} ( X _{i} )} ∣} .

Y_{i, 1}, \dots, Y_{i, n} \sim N (μ_{Y, i}, σ_{i}^{2}) and V_{i, 1}, \dots, V_{i, n} \sim N (μ_{V, i}, σ_{i}^{2})

Y_{i, 1}, \dots, Y_{i, n} \sim N (μ_{Y, i}, σ_{i}^{2}) and V_{i, 1}, \dots, V_{i, n} \sim N (μ_{V, i}, σ_{i}^{2})

μ_{Y, i} = ⎩ ⎨ ⎧ 0.5, 0.25, 0, i = 1, \dots, m_{1} i = m_{1} + 1, \dots, 2 m_{1} otherwise, μ_{V, i} = ⎩ ⎨ ⎧ 0, 0.25, 0, i = 1, \dots, m_{1} i = m_{1} + 1, \dots, 2 m_{1} otherwise

μ_{Y, i} = ⎩ ⎨ ⎧ 0.5, 0.25, 0, i = 1, \dots, m_{1} i = m_{1} + 1, \dots, 2 m_{1} otherwise, μ_{V, i} = ⎩ ⎨ ⎧ 0, 0.25, 0, i = 1, \dots, m_{1} i = m_{1} + 1, \dots, 2 m_{1} otherwise

t_{i} := in f {z \geq 0 : \overset{s}{^}_{CARS}^{- ℓ} (z, X_{i}) \leq \hat{t}_{CARS}^{- ℓ}} .

t_{i} := in f {z \geq 0 : \overset{s}{^}_{CARS}^{- ℓ} (z, X_{i}) \leq \hat{t}_{CARS}^{- ℓ}} .

wFDR (a) := E [\frac{\sum _{i \in H_{0}} a _{i} 1 ( H _{i} rejected )}{\sum _{i = 1}^{m} a _{i} 1 ( H _{i} rejected )} 1 (i = 1 \sum m a_{i} 1 (H_{i} rejected) > 0)] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Covariate powered cross-weighted multiple testing

Nikolaos Ignatiadis1

Wolfgang Huber2

1 Department of Statistics, Stanford University, USA

[email protected]

2 European Molecular Biology Laboratory, Heidelberg, Germany

[email protected]

Summary

A fundamental task in the analysis of datasets with many variables is screening for associations. This can be cast as a multiple testing task, where the objective is achieving high detection power while controlling type I error. We consider $m$ hypothesis tests represented by pairs $((P_{i},X_{i}))_{1\leq i\leq m}$ of p-values $P_{i}$ and covariates $X_{i}$ , such that $P_{i}\perp X_{i}$ if $H_{i}$ is null. Here, we show how to use information potentially available in the covariates about heterogeneities among hypotheses to increase power compared to conventional procedures that only use the $P_{i}$ . To this end, we upgrade existing weighted multiple testing procedures through the Independent Hypothesis Weighting (IHW) framework to use data-driven weights that are calculated as a function of the covariates. Finite sample guarantees, e.g., false discovery rate (FDR) control, are derived from cross-weighting, a data-splitting approach that enables learning the weight-covariate function without overfitting as long as the hypotheses can be partitioned into independent folds, with arbitrary within-fold dependence. IHW has increased power compared to methods that do not use covariate information. A key implication of IHW is that hypothesis rejection in common multiple testing setups should not proceed according to the ranking of the p-values, but by an alternative ranking implied by the covariate-weighted p-values.

Keywords: Benjamini-Hochberg, Empirical Bayes, False Discovery Rate, Independent Hypothesis Weighting, Multiple Testing, p-value weighting

1 Introduction

Screening large datasets for interesting associations is a basic operation in statistical data analysis. A frequently taken approach is to enumerate all potential associations, set up a hypothesis test for each of them, summarize the results by the p-values $P_{i}$ , and select as discoveries all hypotheses with a small enough p-value; typically, this is a small fraction of all hypotheses. More formally, for some cutoff $\hat{t}$ :

[TABLE]

The choice of the cutoff $\hat{t}$ may be data-driven and is determined by a multiple testing procedure, such as those proposed by Bonferroni (1935) or Benjamini and Hochberg (1995), which compute a $\hat{t}$ that provides a defined level of protection against spurious discoveries. Common objectives are control of the family-wise error rate (FWER) or the false discovery rate ( $\operatorname{FDR}$ ).

These procedures operate solely on the list of p-values. Here, we consider situations in which beyond the p-value $P_{i}$ , side information represented by a covariate $X_{i}$ is available for each hypothesis. Such side-information reflects heterogeneity among the tests and may —more or less directly—carry information about their different power, or the different prior probabilities of their null hypothesis being true. Suitable covariates are often apparent to domain scientists or to statisticians. We will see that procedures that take into account such side information often have higher power, in the sense that they make more discoveries at the same level of type-I error.

To illustrate, we use a high-throughput genetics dataset by Grubert et al. (2015), who aimed to discover associations between genetic polymorphisms (SNPs) in the human genome and the activity of genomic regions (H3K27ac peaks). The main idea of the analysis of these data, which is presented in more detail in Section 6, is to carry out a hypothesis test for each pair of SNP and region on the same chromosome. On Chromosomes 1 and 2, $N_{1}=645452$ and $N_{2}=699343$ SNPs were recorded, and H3K27ac levels were measured in $K_{1}=12193$ and $K_{2}=11232$ regions, which amounts to nearly 16 billion ( $N_{1}K_{1}+N_{2}K_{2}$ ) tests. Figure 1 illustrates how the p-value distributions differ as a function of the genomic distance between SNP and region. These differences are consistent with biological domain knowledge: associations across shorter distances are a-priori more plausible and empirically more frequent. Methods that are able to take into account this heterogeneity among the tests should be able to discover more associations at the same $\operatorname{FDR}$ , compared to (1), which ignores such side information.

1.1 Independent Hypothesis Weighting

In this paper, we present Independent Hypothesis Weighting (IHW), a flexible framework that can leverage hypothesis heterogeneity to improve power, while retaining finite-sample type-I error control. To explain the method, consider testing $m$ hypotheses $H_{1},\dotsc,H_{m}$ based on p-values $P_{1},\dotsc,P_{m}$ in the situation where we also have access to covariates $X_{1},\dotsc,X_{m}$ such that each $X_{i}$ is independent of the p-value $P_{i}$ if $H_{i}$ is a null hypothesis; the codomain of the $X_{i}$ can be any space (the same for all $i$ ). We propose to use a decision rule of the following form in place of (1):

[TABLE]

and $I_{\ell},\;\ell=1,\dotsc,K$ is a partition of the hypotheses into $K$ disjoint folds, such that the $(P_{i},X_{i})$ pairs are independent across folds.

There are two salient features to this rule: first, the decision boundary of hypothesis $i$ does not only depend on its p-value $P_{i}$ and the overall cutoff $\hat{t}$ , but also on the weight function $\widehat{W}^{-\ell}:{\mathcal{X}}\to\mathbb{R}_{\geq 0}$ of the covariate $X_{i}$ , where ${\mathcal{X}}$ is the codomain of the $X_{i}$ , and there is one such function for each fold $I_{\ell}$ . Second, the notation $\widehat{W}^{-\ell}$ is used to denote that each of these functions is learned from the data with the proviso that only p-values from the $K-1$ folds excluding $I_{\ell}$ are used. We call this proviso cross-weighting.

Conceptually, cross-weighting is related to cross-fitting (Schick, 1986), a method that has been successful in the fields of causal inference (Nie and Wager, 2020; Chernozhukov et al., 2017) and empirical Bayes (Ignatiadis and Wager, 2019) for estimation with high-dimensional nuisance parameters. Analogous to findings in the cross-fitting literature, we will show that naively using plug-in estimators to obtain the weight function tends to overfit, but cross-weighting salvages this at essentially no cost.

1.2 Related work

Previous work has shed light on optimal discovery thresholds in heterogeneous multiple testing. Similar to (2), these thresholds may take the form $\{P_{i}\leq\hat{t}w_{i}\}$ parametrized in terms of weights $w_{i}$ that are optimal for controlling the family-wise error rate (FWER) (Roeder and Wasserman, 2009; Peña et al., 2011; Dobriban et al., 2015) or the false discovery rate ( $\operatorname{FDR}$ ) (Roquain and Van De Wiel, 2009; Durand, 2019). Furthermore, in the case of $\operatorname{FDR}$ control, optimal decision thresholds are known to take the form of contours of equal local false discovery rate (Cai and Sun, 2009; Cai et al., 2019; Efron, 2010; Ferkingstad et al., 2008; Ochoa et al., 2015; Ploner et al., 2006; Scott et al., 2015). Nevertheless, all of these optimal procedures are not implementable, as they depend on unknown properties of the data-generating mechanism. Instead, it has been proposed to apply a plug-in principle: the thresholds are estimated from the data at hand.

Such plug-in approaches however have no guarantees of type-I error control or only do so in an asymptotic limit, as the number of tested hypotheses goes to infinity (Cai and Sun, 2009; Cai et al., 2019; Durand, 2019; Ignatiadis et al., 2016). More importantly, with finite samples, these plug-in methods often exceed the claimed type-I error; we will demonstrate this in Sections 2.1 and 5. This has motivated the provision of case-by-case, ad-hoc modifications, which however still do not provide finite-sample guarantees. For example, Durand (2017) recommends conducting a global test first and only proceeding with multiple testing if the global null hypothesis can be rejected. Cai, Sun, and Wang (2019) use a conservative modification of the density estimator employed by their (asymptotically valid) plug-in approach and show that this controls $\operatorname{FDR}$ in simulations with sparse signals. Furthermore, they suggest using the global screen of Durand (2017) first. Ignatiadis, Klaus, Zaugg, and Huber (2016) use cross-weighting (described above) as a heuristic to maintain $\operatorname{FDR}$ control in finite samples.

Dispensing with heuristics, several authors have recently provided procedures that are formally justified under full independence of the hypotheses: Li and Barber (2019) propose SABHA, a data-driven, weighted procedure for $\operatorname{FDR}$ control which directly confronts potential overfitting. The authors prove finite sample $\operatorname{FDR}$ control at an elevated level compared to the nominal $\alpha$ ; i.e., at $(1+\varepsilon)\alpha$ for some $\varepsilon>0$ . However, their guarantee only applies for their specific weighting scheme, which furthermore is suboptimal even under knowledge of the data-generating process (Lei and Fithian, 2018). Zhang, Xia, Zou, and Tse (2017) and Zhang, Xia, and Zou (2019) use a variant of hypothesis splitting to guarantee high-probability bounds on the false discovery proportion, however their proposals require a minimum number of rejections; otherwise an empty list of discoveries is declared. Closer to our approach is AdaPT (Lei and Fithian, 2018), which uses covariate information to learn covariate-modulated decision boundaries and provides finite sample $\operatorname{FDR}$ guarantees. Its construction is based on a variant of the optimal stopping theorem developed by Barber and Candès (2015), which provides the analyst with considerable flexibility in learning these boundaries from the data, while masking information that could lead to overfitting. However, AdaPT has no theoretical guarantees outside of full p-value independence, is tied to $\operatorname{FDR}$ control and suffers from a large variance of the false discovery proportion (Korthauer et al., 2019).

Here we propose a general and flexible framework that goes beyond these previous approaches. We formalize hypothesis weighting with weights as a function of covariates $X_{i}$ and demonstrate that such weights can be learned from the data without overfitting (i.e., losing type-I error control) if we use cross-weighting as in (2). Hence we build upon the hypothesis-splitting idea of Ignatiadis et al. (2016) and demonstrate that it can be used not merely as a heuristic, but instead as a theoretically grounded and principled way of conducting multiple testing with side-information that has far reaching applications. The Independent Hypothesis Weighting method provides finite sample guarantees for multiple type-I measures, such as the $\operatorname{FDR}$ , the FWER and the $k$ -FWER, unlike previous proposals that are tied to the $\operatorname{FDR}$ . IHW provides a clean way to deal with dependent settings, as it allows arbitrary dependence within folds. Finally, IHW provides the researcher with flexibility in choosing any weighting scheme that would be appropriate for the data at hand, but we also recommend a default scheme and provide a software implementation in the form of an R package.

1.3 Outline

In Section 2, we provide an overview of weighted multiple testing and explain our proposal in the context of $\operatorname{FDR}$ control under full independence of hypothesis tests. Section 3 extends the results to dependence, and to control of the $k$ -FWER. Section 4 describes a framework for learning weighting rules. Section 5 provides simulation results, and Section 6 presents the high-throughput biology example from Figure 1. Section 7 discusses further relationships to previous work, and Section 8 concludes with a discussion.

2 Weighted and cross-weighted multiple testing

A multiple testing procedure operates on data for $m$ hypotheses $H_{1},\dotsc,H_{m}$ and declares $R$ hypotheses as rejections (“discoveries”). Among these, $V$ will be nulls, i.e., the procedure will commit $V$ type I errors. The goal is to make as many discoveries as possible while retaining (stochastic) guarantees that $V$ is acceptable. Concretely, one possible objective is to control the family-wise error rate, defined as FWER $\coloneqq\mathbb{P}[V\geq 1]$ , or the $k$ -FWER $\coloneqq\mathbb{P}[V\geq k]$ . In exploratory situations, a typically less stringent objective is to control the false discovery rate ( $\operatorname{FDR}$ ), i.e., the expectation of the false discovery proportion (FDP), namely $\operatorname{FDR}\coloneqq\mathbb{E}[\text{FDP}]\coloneqq\mathbb{E}\left[\frac{V}{R\lor 1}\right]$ (Benjamini and Hochberg, 1995).

Typically the data for each hypothesis are summarized into a single number, the p-value $P_{i}$ , and a rule of form (1) is applied. However, in the presence of heterogeneity across tests, it might be suboptimal to use such a decision rule that treats all hypotheses exchangeably. Weighted multiple testing (Genovese, Roeder, and Wasserman, 2006) is a flexible way of encoding prior information and differentially prioritizing the hypotheses. Multiple testing weights are defined as non-negative numbers $w_{i}$ such that $\sum_{i=1}^{m}w_{i}/m=1$ . Then, a weighted multiple testing decision rule takes the following form:

[TABLE]

Here $\tau\in(0,1]$ is a fixed number, of which more below, and as in (1), the cutoff $\hat{t}$ may be data-driven. A larger $w_{i}$ implies that it is easier to reject hypothesis $i$ . We first review two procedures for choosing $\hat{t}$ .

Definition 1 (Weighted $k$ -Bonferroni).

The $k$ -FWER can be controlled at level $\alpha\in(0,1)$ by applying the weighted $k$ -Bonferroni procedure (Romano and Wolf, 2010), which takes the form (3) with deterministic cutoff $\hat{t}=k\alpha/m$ and $\tau=1$ . The case $k=1$ is the weighted Bonferroni procedure proposed by Genovese et al. (2006).

Definition 2 ( $\tau$ -censored, weighted Benjamini-Hochberg).

The $\operatorname{FDR}$ can be controlled at level $\alpha\in(0,1)$ by applying the $\tau$ -censored, weighted Benjamini-Hochberg procedure, which takes the form (3) with $\tau\in(0,1]$ fixed and data-driven cutoff $\hat{t}$ specified as:

[TABLE]

The weighted Benjamini-Hochberg (BH) procedure of Genovese, Roeder, and Wasserman (2006) is the special case $\tau=1$ . The more general form was proposed by Li and Barber (2019) and will be employed for our theoretical guarantees in the following. The number of rejections of $\tau$ -censored BH is non-decreasing in $\tau$ , so that a procedure with smaller $\tau$ will never make more discoveries. However, for large $\tau$ , say $\tau\geq 0.5$ , the discovery set will be equal to that with $\tau=1$ , as long as weighted BH with $\tau=1$ did not reject a p-value $\geq 0.5$ .

In decision rule (3), the weights $w_{i}$ are denoted by lower-case letters. This reflects the fact that existing results treat these weights as deterministic—as prior knowledge that a researcher has to specify before seeing the p-values (Genovese, Roeder, and Wasserman, 2006; Blanchard and Roquain, 2008; Habiger, 2017; Roquain and Van De Wiel, 2009; Ramdas, Barber, Wainwright, and Jordan, 2019). The main goal of this work is to let the weights depend on the data at hand—they are thus denoted as random variables $W_{i}$ —while providing finite-sample guarantees. Such data-dependent weighting has been recognized as an important open problem (Benjamini, 2008; Roquain and Van De Wiel, 2009) that is essential for dealing with large scale multiple testing. To the best of our knowledge, no solution has been provided so far. Existing proposals for data-driven weighting either explicitly account for overfitting by establishing $\operatorname{FDR}$ control at an elevated level compared to nominal (Li and Barber, 2019) or only provide guarantees in the asymptotic limit (Hu, Zhao, and Zhou, 2010; Ignatiadis, Klaus, Zaugg, and Huber, 2016; Durand, 2019; Zhao and Zhang, 2014; Wang, 2018; Roeder, Devlin, and Wasserman, 2007).

2.1 Example: Group Benjamini-Hochberg with cross-weighting

We first provide a rudimentary version of our method that is applicable to situations with categorical (or suitably categorized) covariates $X_{i}\in\left\{1,\dotsc,G\right\}$ . This setting is called multiple testing with groups; each group consists of hypotheses whose covariate $X_{i}$ takes on the same value. Our method builds upon the Group Benjamini-Hochberg (GBH) method proposed by Hu et al. (2010) to improve power compared to BH by using the group structure. GBH consists of first estimating the proportion of null hypotheses $\pi_{0}(g)$ in each group by $\widehat{\pi}_{0}(g)$ , weighting each hypothesis proportionally to $(1-\widehat{\pi}_{0}(g))/\widehat{\pi}_{0}(g)$ and finally applying the weighted BH procedure. Algorithm 1 describes the method in detail111A simplification is that in Algorithm 1, the weights are specified so that $\sum_{i}W_{i}=m$ . In contrast, in the original GBH paper (Hu et al., 2010), the weights are less conservative and satisfy $\sum_{i}\widehat{\pi}_{0}(X_{i})W_{i}=m$ . This inflation ensures that in the oracle case of known $\pi_{0}(\cdot)$ , the $\operatorname{FDR}$ of GBH is exactly equal to $\alpha$ . We return to the issue of null proportion adaptivity in Section 2.3 and Theorem 2; in the case of GBH it may be regained by employing the optional step in Algorithm 1, cf. Ramdas et al. (2019)., using the estimator of Storey et al. (2004) applied to the grouped setting, analogous to Sankaran and Holmes (2014).

Hu, Zhao, and Zhou (2010) provide the following guarantees for GBH: in the oracle situation where the $\pi_{0}(g)$ are known, GBH controls the $\operatorname{FDR}$ . In the asymptotic limit where the number of groups is fixed, the number of hypotheses in each group grows to infinity and $\operatorname{plim}_{m\to\infty}\widehat{\pi_{0}}(g)\geq\pi_{0}(g)$ for all $g$ , GBH controls the $\operatorname{FDR}$ . Furthermore, sufficient conditions are given so that asymptotically GBH is at least as powerful as BH. The asymptotics, however, do not necessarily apply for finite $m/G$ , the number of hypotheses per group, as shown by simulations summarized in Figure 2. Intuitively, the reason is that some groups will randomly be enriched for smaller than expected p-values (and some for larger than expected ones), and the method further up-weights the former set of null p-values.

Our solution is to use cross-weighting. We assign each hypothesis to one of $K$ folds – randomly and independently of its p-value $P_{i}$ and covariate $X_{i}$ – and then calculate weights out-of-fold, as elaborated in Algorithm 2. With cross-weighting, a null p-value that is small by chance cannot lead to an upweighting of itself. $\operatorname{FDR}$ control is restored, as shown in Figure 2. On the other hand, if the weights are determined not just by noise, but by true signal, then IHW-GBH, just as GBH, has increased power compared to BH, as we show in a more comprehensive simulation study in Section 5.1. If $G$ furthermore remains fixed as $m\to\infty$ , then GBH and IHW-GBH are asymptotically equivalent (Corollary 2).

2.2 IHW: A family of multiple testing procedures

We now generalize the IHW-GBH procedure beyond categorical covariates, the GBH weighting scheme and the weighted BH procedure (Def. 2): we seek a general way of applying weighted multiple testing methods with data-driven weights $W_{i}$ when covariates $X_{i}$ —not necessarily categorical—are available. Our approach consists of two ingredients: first, we only consider weights that are functions of the covariates $X_{i}$ , i.e., $W_{i}=W(X_{i})$ . The second ingredient is cross-weighting: we partition our $m$ hypotheses into $K$ disjoint folds222Our baseline proposal is to construct the partition by splitting the set $[m]=\{1,\ldots,m\}$ into $K$ (the default in the IHW software package is $K=5$ ) equally sized folds randomly. Alternatively, domain specific knowledge can be used to derive folds that minimize across-fold dependence, cf. the example in Section 6. $I_{1},\dotsc,I_{K}$ . Then, in determining the weight $W_{i}$ for hypothesis $i\in I_{\ell}$ , we set $W_{i}\propto\widehat{W}^{-\ell}(X_{i})$ , where the weight function $\widehat{W}^{-\ell}$ is learned from data outside fold $I_{\ell}$ and the weights are normalized, typically such that $\sum_{i\in I_{\ell}}W_{i}=\left\lvert I_{\ell}\right\rvert$ . This overall framework is summarized in Algorithm 3.

In Sections 2.3 and 3.2 we provide formal guarantees of finite-sample type-I error control for the IHW algorithm, under the condition that the weighted multiple testing procedure is weighted BH with $\tau$ -censoring or weighted $k$ -Bonferroni. We will discuss how to learn weight functions for general (non-categorical) covariates in Section 4.

2.3 Finite-sample FDR control with cross-weighting under independence

To derive formal guarantees for Algorithm 3, we set out with a sufficient distributional assumption that contains several independence relationships. In Section 3, we will consider more general dependence structures.

Assumption 1 (Distributional setting under independence).

Let $(P_{i},X_{i})$ , $i\in[m]$ be333We use the notation $[m]=\left\{1,\dotsc,m\right\}$ . (p-value, covariate) pairs and $\mathscr{H}_{0}\subset[m]$ be the index set of null hypotheses. We assume that:

(a1)

The null pairs $((P_{i},X_{i}))_{i\in\mathscr{H}_{0}}$ are jointly independent. 2. (a2)

The null pairs $((P_{i},X_{i}))_{i\in\mathscr{H}_{0}}$ are independent of the alternative pairs $((P_{i},X_{i}))_{i\notin\mathscr{H}_{0}}$ . 3. (b )

For $i\in\mathscr{H}_{0}$ , it holds that $P_{i}$ is independent of $X_{i}$ . 4. (c )

For $i\in\mathscr{H}_{0}$ , $P_{i}$ is super-uniform, i.e., $\mathbb{P}[P_{i}\leq t]\leq t$ for all $t\in[0,1]$ .

To parse this assumption, let us first consider two important special cases: (i) marginalizing over the $X_{i}$ , so that we only have access to p-values, and (ii) deterministic $X_{i}$ . In both cases, Assumption 1 reduces to (a1’) $(P_{i})_{i\in\mathscr{H}_{0}}$ are jointly independent, (a2’) independent of the alternative p-values $(P_{i})_{i\notin\mathscr{H}_{0}}$ and (c). Of these, (a1’) and (a2’), while admittedly strong, are a typical starting point for proving finite-sample results for multiple testing procedures, even in the absence of covariates: Liang and Nettleton (2012) call it the null independence assumption. In the setting with covariates, these are also assumptions made by Li and Barber (2019, Theorem 1) and Lei and Fithian (2018, Theorem 1). Cai et al. (2019) also assume full independence of hypotheses. The super-uniformity; also called conservativeness, of the null p-values (c) is also a standard assumption in multiple testing (Blanchard and Roquain, 2008). Li and Barber (2019) make a stronger assumption than (c).

The case of deterministic $X_{i}$ is important, since for example the genomic distance between SNPs and peaks in our motivating example in Figure 1 is a deterministic covariate. See Supplement S6.1 for additional examples. Nevertheless, we formulate results for the more general case to also handle situations in which the covariate $X_{i}$ is calculated from the same data that are used to calculate the p-value $P_{i}$ . For instance, Cai et al. (2019) consider simultaneous two-sample testing, and construct an ancillary $X_{i}$ that is independent of the $t$ -statistic (and thus also the p-value) under the null hypothesis; we revisit their construction in the simulation study of Section 5.3. Assumption 1(b) is crucial in ensuring that knowledge of $X_{i}$ does not influence the null distribution. Cai et al. (2019) call it a ”principle for information extraction”; cf. Bourgon et al. (2010); Boca and Leek (2018) for further elaborations on this assumption and Supplement S6.2 for more examples of random covariates.

Next, we state two specifications on the weighting mechanism used. Unlike Assumption 1, the applicability of which depends on the generally unknown data-generating mechanism, these are entirely under the control of the analyst.

Specification 1 (Honest weighting).

Consider a partition of $[m]$ into $K$ folds $I_{1},\dotsc,I_{K}$ , i.e., $\bigcup_{\ell}I_{\ell}=[m]$ and $\left(I_{\ell}\right)_{\ell}$ are disjoint, and define $I_{\ell}^{c}=[m]\setminus I_{\ell}$ . The partition is assigned independently of $((P_{i},X_{i}))_{i\in[m]}$ . Then, the data-driven weights $(W_{i})_{i\in[m]}$ are honest with respect to the partition $I_{1},\dotsc,I_{K}$ if:

(a)

$W_{i}$ is a function of only $(P_{j})_{j\in I_{\ell}^{c}}$ and $(X_{j})_{j\in[m]}$ for all $\ell\in[K]$ and all $i\in I_{\ell}$ . 2. (b)

The weights in fold $I_{\ell}$ average to $1$ , i.e., $\sum_{i\in I_{\ell}}W_{i}=|I_{\ell}|$ for all $\ell\in[K]$ . 3. (c)

$W_{i}\geq 0$ for all $i$ .

We call this specification “honest weighting”, borrowing terminology from the honest tree construction of Wager and Athey (2018), who call a regression tree honest if the set of observations used to determine its structure is disjoint from the set of observations used for prediction in the leaves. Specification 1 encapsulates our idea of cross-weighting. Informally, it says that the weight $W_{i}$ of hypothesis $i$ should not depend on its p-value $P_{i}$ . As already shown in Figure 2, without honesty it is easy to overfit the data. Part (b) of the definition encapsulates a fixed weighting budget (Genovese, Roeder, and Wasserman, 2006). Instead of merely requiring $\sum_{i=1}^{m}W_{i}=m$ , the budget is restricted within each fold, to prevent information leakage across folds through the total magnitude of the weights.

Honesty suffices to guarantee type-I error control in some cases, for example for the weighted $k$ -Bonferroni procedure (Section 3.2 and Theorem 3). However, for the $\tau$ -censored, weighted BH procedure with data-driven weights, we require one further condition on the weights, which was proposed by Li and Barber (2019) and states that the magnitude of p-values less than or equal to $\tau$ must be concealed from the weighting algorithm.

Specification 2 ( $\tau$ -censored weighting).

The weights $W_{i}$ are called $\tau$ -censored for $\tau\in(0,1]$ if they depend on the p-values $(P_{i})_{i\in[m]}$ only through $(P_{i}\;\mathbf{1}(P_{i}>\tau))_{i\in[m]}$ .

We are ready to state the first result:

Theorem 1 (IHW-BH controls the $\operatorname{FDR}$ under honesty and $\tau$ -censored weighting).

Let $((P_{i},X_{i}))_{i\in[m]}$ satisfy Assumption 1. Furthermore assume that we construct data-driven weights $W_{i}$ that are honest (Specification 1) and $\tau$ -censored (Specification 2) for some $\tau\in(0,1]$ . Then the $\tau$ -censored, weighted BH procedure (Definition 2) with p-values $P_{i}$ and weights $W_{i}$ controls the $\operatorname{FDR}$ at the nominal level $\alpha$ .

The intuition for this theorem is the following: in the weighted BH algorithm (Definition 2), the rejection threshold of a null p-value $P_{i}$ depends on its weight $W_{i}$ and the total number of rejections $R$ . Assumption 1 and honest weighting (Specification 1) ensure that a null p-value cannot influence its own weight. However, tests can coordinate adversarially by weighting each other in a way that increases $R$ and potentially leads to their own rejection. Supplement S1.3 provides an example of how such adversarial coordination can break $\operatorname{FDR}$ -control guarantees, even though honesty holds. However, under $\tau$ -censoring, the only p-values that can coordinate through weight assignment are the ones $>\tau$ . These p-values are also excluded from being rejected and so $\operatorname{FDR}$ control is restored.

As a corollary, we get the following result:

Corollary 1 (IHW-GBH controls the $\operatorname{FDR}$ ).

Let $((P_{i},X_{i}))_{i\in[m]}$ satisfy Assumption 1, then the IHW-GBH procedure (without the null proportion adaptivity step) described in Algorithm 2 controls the $\operatorname{FDR}$ at the nominal level $\alpha$ .

Proof.

By construction, the weights $W_{i}$ of IHW-GBH are honest and $\tau$ -censored. ∎

A shortcoming of IHW-BH with weights that satisfy $\sum_{i=1}^{m}W_{i}=m$ is that FDR is controlled at $\pi_{0,W}^{\prime}\alpha\leq\alpha$ , where $\pi^{\prime}_{0,W}\coloneqq(\sum_{i\in\mathscr{H}_{0}}\mathbb{E}[W_{i}])/m$ and IHW-BH can thus be needlessly conservative. Motivated by null-proportion adaptive methods for unweighted BH (Storey, Taylor, and Siegmund, 2004) and weighted BH with deterministic weights (Habiger, 2017; Ramdas, Barber, Wainwright, and Jordan, 2019), we estimate $\pi^{\prime}_{0,W}$ within fold $I_{\ell}$ by

[TABLE]

and use these estimates to inflate the weights $W_{i}$ . We have the following result:

Theorem 2 (IHW-Storey controls the $\operatorname{FDR}$ under honesty and $\tau$ -censored weighting).

Assume that all assumptions of Theorem 1 are satisfied. Next let $\hat{\pi}_{0,W,\ell}^{\prime}$ be defined as in (5) and define null-proportion adaptive weights as $W_{i}^{\text{Storey}}:=W_{i}\,/\,\hat{\pi}_{0,W,\ell}^{\prime}$ for $i\in I_{\ell}$ . Then the $\tau$ -censored, weighted BH procedure (Definition 2) with p-values $P_{i}$ and weights $W_{i}^{\text{Storey}}$ controls the $\operatorname{FDR}$ at the nominal level $\alpha$ .

A direct application of this theorem is that the statement of Corollary 1 also holds for the null-proportion adaptive version of IHW-GBH (cf. Algorithm 2). This provides power gains in situations where the null proportion is substantially smaller than 1 at least in some regions of the covariate space, since then it will be the case that $\sum W_{i}^{\text{Storey}}>\sum W_{i}$ , thus increasing the total weight budget.

2.4 FDR asymptotics with cross-weighting under independence

While the primary focus of this paper is on finite-sample guarantees and performance in simulations, in this section we provide asymptotic results for $m\to\infty$ that serve three purposes: first, they demonstrate how cross-weighting enables a streamlined proof of asymptotic $\operatorname{FDR}$ control under standard assumptions on $(P_{i},X_{i})$ while dispensing of requirements on the class of weight functions. Second, they show that in situations in which there is sufficient signal and the data-driven weight function has approached its asymptotic limit, no power is lost by using cross-weighting. Third, they show that in an asymptotic regime, IHW-BH controls the $\operatorname{FDR}$ without a need for $\tau$ -censoring (Specification 2). On the other hand, our aim here is not to provide the sharpest asymptotics under the weakest conditions, but just to provide these conceptual insights.

We develop the asymptotics using the following Bayesian model (Ferkingstad et al., 2008; Lei and Fithian, 2018; Deb et al., 2021), which we call the conditional two-groups model and which extends the two-groups model of Storey (2003) and Efron, Tibshirani, Storey, and Tusher (2001):

[TABLE]

We also define $F(t\mid X_{i}=x)=\pi_{0}(x)t+(1-\pi_{0}(x))F_{\text{alt}}(t\mid X_{i}=x)$ : the distribution of $P_{i}$ given $X_{i}=x$ . The distribution $F(t\mid X_{i}=x)$ can vary from test to test because of varying null probabilities $\pi_{0}(x)$ and/or alternative distributions $F_{\text{alt}}(\cdot\mid X_{i}=x)$ , depending on the value of its covariate $X_{i}$ .

Since $m$ is a changing parameter in the asymptotics, it is useful to formalize what “learning a weight function” entails and use more involved notation:

Specification 3 (Weighting scheme).

A weighting scheme $\widehat{W}^{(\cdot)}$ is a mechanism that, for any finite subset $I\subset\mathbb{N}_{>0}$ , uses samples $((P_{i},X_{i}))_{i\in I}$ to learn a weight function $\widehat{W}^{(I)}:\mathcal{X}\to\mathbb{R}_{\geq 0}$ . We assume that the learned weight function $\widehat{W}^{(I)}$ does not excessively upweight individual hypotheses, i.e., there exists $\Gamma<\infty$ such that

[TABLE]

Given $m$ independent draws $(P_{i},X_{i})$ from (6) and a weighting scheme (Specification 7), we seek to apply learned weights in conjunction with weighted BH (Definition 2). We consider two possibilities:

Naive weighted BH: We use all data $((P_{i},X_{i}))_{i\in[m]}$ to learn $\widehat{W}^{([m])}$ and let $W_{i}\propto\widehat{W}^{([m])}(X_{i})$ for $i=1,\dotsc,m$ , such that the weights average to $1$ (i.e., $\sum_{i=1}^{m}W_{i}=m$ ). Then we apply the weighted BH procedure with p-values $P_{i}$ and weights $W_{i}$ . 2. 2.

IHW-BH: We partition $[m]$ into $K$ disjoint folds $I_{1},\dotsc,I_{K}$ , independently of $((P_{i},X_{i}))_{i\in[m]}$ . Then we apply Algorithm 3 in conjunction with weighted BH, i.e., for each fold $\ell$ , we apply the weighting scheme on $[m]\setminus I_{\ell}$ and for $i\in I_{\ell}$ set weight $W_{i}\propto\widehat{W}^{([m]\setminus I_{\ell})}(X_{i})$ and such that the weights average to $1$ in that fold (i.e., $\sum_{i\in I_{\ell}}W_{i}=1$ ). Then we apply weighted BH with p-values $P_{i}$ and weights $W_{i}$ . We note that the data-driven weights $W_{i}$ are honest (Specification 1) by construction. However, for the asymptotics, we do not require $\tau$ -censoring (Specification 2), but instead require the mild technical condition (7).

Proposition 1.

Let $(P_{i},X_{i})$ be i.i.d. from the conditional two-groups model (6) satisfying regularity Assumption 3 (in Supplement S2). If the partition satisfies $|I_{\ell}|/m\to\gamma_{\ell}\in(0,1)$ as $m\to\infty$ for all $\ell$ , then555See Supplement S2 for the proof and formal statements.:

(a)

There exists a weighting scheme satisfying Specification 7, such that the naive weighted BH procedure asymptotically does not control the $\operatorname{FDR}$ . 2. (b)

For any weighting scheme satisfying Specification 7, the IHW-BH procedure asymptotically controls the $\operatorname{FDR}$ . 3. (c)

Consider a weighting scheme that converges in probability to a deterministic limiting weight function $W^{*}:\mathcal{X}\to\mathbb{R}_{\geq 0}$ ,

[TABLE]

Then, the naive weighted BH and IHW-BH procedures have the same power asymptotically.

Proof idea for (a) and (b):.

The proof of Storey et al. (2004) for asymptotic $\operatorname{FDR}$ control of BH argues that by the Glivenko-Cantelli theorem, $\sup_{t}\left\lvert\frac{1}{m}\sum_{i=1}^{m}\left[\mathbf{1}(P_{i}\leq t)-\mathbb{P}[P_{i}\leq t]\right]\right\rvert\stackrel{{\scriptstyle\mathbb{P}}}{{\to}}0$ and similarly for the subset of null hypotheses. A consequence is that the BH estimator of the false discovery rate is asymptotically uniformly conservative over all thresholds $\geq\delta>0$ , which in turn implies asymptotic $\operatorname{FDR}$ control. Extending this argument to the weighted case requires uniform convergence: $\sup_{t}\left\lvert\frac{1}{m}\sum_{i=1}^{m}\left[\mathbf{1}(P_{i}\leq tW_{i})-\mathbb{P}[P_{i}\leq tW_{i}]\right]\right\rvert\stackrel{{\scriptstyle\mathbb{P}}}{{\to}}0$ .

For data-driven weights, this can be achieved by learning the weight function from a suitably restricted class $\mathcal{W}$ . Du and Zhang (2014); Ignatiadis et al. (2016); Durand (2019) all use $\mathcal{W}$ such that the functions $\{(p,x)\mapsto\mathbf{1}(p\leq tW(x))\mid t\in(0,1],W(\cdot)\in\mathcal{W}\}$ are $\mathbb{P}$ -Glivenko-Cantelli (van der Vaart, 2000). Similarly, Li and Barber (2019) consider $\mathcal{W}$ with low Rademacher complexity. On the other hand, if convergence is not uniform (e.g., if we are free to choose any weights satisfying Specification 7), then we can find regions of $\mathcal{X}$ -space that are enriched for small p-values merely by chance, upweight them, and violate $\operatorname{FDR}$ control (cf. Figure 2).

Instead, through cross-weighting, the richness of $\mathcal{W}$ is irrelevant: upon conditioning on other folds, $P_{i}/\widehat{W}^{([m]\setminus I_{\ell})}(X_{i})$ in fold $I_{\ell}$ are i.i.d., and thus the one-dimensional Glivenko-Cantelli result applies. ∎

In words, while data-driven weights can lead to overfitting (a), cross-weighting universally alleviates this (b). A further upshot of (b) is that it dispenses with the requirement for $\tau$ -censored weights (Specification 2). Finally, the objection may be raised to cross-weighting that it drops data and should thus be less powerful than a procedure that uses all the data. However, (c) shows that asymptotically one loses no power by using cross-weighting if the weighting procedure is well-behaved, i.e., the weights asymptotically converge to a limit.

As a corollary of Proposition 1, we have that:

Corollary 2 (IHW-GBH asymptotics).

Under the assumptions of Proposition 1 with $\mathcal{X}=[G]$ for fixed $G\in\mathbb{N}$ , the GBH and IHW-GBH procedures without null proportion adaptivity, described in Algorithms 1 and 2, have the same power asymptotically.

Proof.

In Supplement S2.4, we verify (7) and the condition from part (c) of Proposition 1. ∎

At this point, we note that Durand (2019), motivated by a preprint version of this work, derived the following related and elegant result: in the setting with $\mathcal{X}$ a finite discrete space, Durand (2019, Theorem 7.1.) constructs a cross-weighted procedure that asymptotically controls the $\operatorname{FDR}$ and simultaneously achieves the power of the optimal weighted procedure.

3 Extension to dependence

3.1 The key assumption: Independence across folds, dependence within

Assumption 1 made the strong assumption of joint independence of all null p-values and was sufficient for the results presented in Section 2. Real data commonly deviate from this assumption. The consequences of such deviations on the applicability of results derived using independence assumptions are typically difficult to reason about. It is therefore desirable to construct guarantees that can be derived from weaker assumptions that are closer to realistic patterns of dependence.

Assumption 2 (Distributional setting with dependence).

Let $(P_{i},X_{i})$ , $i\in[m]$ be (p-value, covariate) pairs, $I_{1},\dotsc,I_{K}$ be folds of a partition of $[m]$ that is defined based on information independent of $((P_{i},X_{i}))_{i\in[m]}$ , and let $\mathscr{H}_{0}\subset[m]$ the index set of null hypotheses. We assume that:

(a)

The (p-value, covariate) pairs are independent across folds $I_{1},\dotsc,I_{K}$ , but may be dependent within each fold. Formally, $((P_{i},X_{i}))_{i\in I_{\ell}},\ell\in[K]$ are jointly independent. 2. (b)

For $i\in\mathscr{H}_{0}$ , it holds that $P_{i}$ is independent of $(X_{j})_{j\in[m]}$ . 3. (c)

For $i\in\mathscr{H}_{0}$ , $P_{i}$ is super-uniform, i.e., $\mathbb{P}[P_{i}\leq t]\leq t$ for all $t\in[0,1]$ .

Let us compare Assumption 2 to Assumption 1. Parts 2(b, c) are mild. Part 2(c) is identical to 1(c) and standard in multiple testing. Part 2(b) is analogous to 1(b), albeit stronger, since we are conditioning on the full vector of $X_{i}$ . Nevertheless, 2(b) is implied by 1(a,b). In the important case where the $X_{i}$ are deterministic, 1(b) trivially holds. But it also allows for situations where, for instance, the $X_{i}$ are random spatial locations. In this case, we may expect p-values with similar $X_{i}$ to be correlated. Assumption 2(b) then means that knowing the locations $X_{i}$ of all hypotheses provides no information about a single null p-value $P_{i}$ .

The critical assumption is 2(a). Without covariates, the assumption implies that $I_{1},\dotsc,I_{K}$ is a partition of p-values into independent blocks. This is not an assumption typically encountered in the multiple testing literature, although it has appeared e.g., in Heesen and Janssen (2015); Guo and Sarkar (2019). It is fundamental to the cross-weighting approach, the core idea of which is to avoid any dependence between each individual null p-value $P_{i}$ and its data-driven weight $W_{i}$ . Cross-weighting ensures that $W_{i}$ is determined based on $X_{i}$ and p-values from the other folds, but not $P_{i}$ . This would no longer be true with dependence across folds. This observation is analogous to a similar phenomenon in cross-validation. In Chapter 7.1 of the Elements of Statistical Learning, Hastie, Tibshirani, and Friedman (2009) caution practitioners to split data into independent folds when evaluating a supervised learning method by cross-validation (CV): if the folds are not independent, the CV estimates of prediction error are not reliable.

From the application perspective, the assumption is practical: domain experts often have sufficient understanding of their data to find suitable partitions of the hypotheses into independent blocks. In the example from Figure 1, further detailed in Section 6, it is plausible to assume that the data for hypotheses located on different chromosomes are independent, or at least that any potential dependences are negligible. As another example, for covariates $X_{i}$ that correspond to spatial or temporal positions, hypotheses that are sufficiently far away from each other will be independent if the dependences are mediated by spatial or temporal proximity.

We note that all other existing methods for multiple testing with covariates that provide $\operatorname{FDR}$ control assume either full independence (Lei and Fithian, 2018; Cai et al., 2019), weak dependence (Li and Barber, 2019) or the ability to consistently estimate the joint distribution of all hypotheses (Sun and Cai, 2009). Thus, Assumption 2 is a practical starting point towards dealing with common patterns of dependence encountered in real data.

Next, we describe two multiple testing methods with data-driven weights that have provable type-I error guarantees under dependence.

3.2 $k$ -FWER control with cross-weighting under dependence

$k$ -FWER control is achieved by applying cross-weighting in conjunction with the weighted $k$ -Bonferroni procedure of Definition 1. We are not aware of existing procedures with data-driven weights and finite-sample $k$ -FWER control. Existing proposals provide asymptotic guarantees (Wang, 2018).

The proof is direct and without technical complications. We provide it here in the main text, since it shows the key idea behind cross-weighting: each null p-value $P_{i}$ is independent of its weight $W_{i}$ , and this protects against overfitting.

Theorem 3.

Let $((P_{i},X_{i}))_{i\in[m]}$ satisfy Assumption 2 (or Assumption 1) with respect to the partition $I_{1},\dotsc,I_{K}$ . Furthermore assume that we construct data-driven weights $W_{i}$ that are honest w.r.t. $I_{1},\dotsc,I_{K}$ (Specification 1). Then the weighted $k$ -Bonferroni procedure (Definition 1) with p-values $P_{i}$ and weights $W_{i}$ controls the $k\text{-FWER}$ at the nominal level $\alpha$ .

Proof.

We first show that $P_{i}$ is independent of $W_{i}$ ( $P_{i}\perp W_{i}$ ) for any $i\in\mathscr{H}_{0}$ . Without loss of generality, $i\in\mathscr{H}_{0}\cap I_{\ell}$ . By honesty, $W_{i}$ is a function only of the p-values in the other folds, $(P_{j})_{j\in I_{\ell}^{c}}$ and all covariates $\mathbf{X}=(X_{j})_{j\in[m]}$ . It thus suffices to argue that $P_{i}$ is independent of $((P_{j})_{j\in I_{\ell}^{c}},\mathbf{X})$ . This follows from Assumption 2 (resp. Assumption 1). We next bound the $k\text{-FWER}$ .

[TABLE]

Note that in $(*)$ , we used the fact that for $i\in\mathscr{H}_{0}$ it holds that $P_{i}$ is super-uniform and $P_{i}$ is independent of $W_{i}$ . In the last step we used that honesty ensures that $\sum_{i}W_{i}=m$ . ∎

3.3 FDR control with cross-weighting under dependence

We recall the basic procedure for controlling $\operatorname{FDR}$ with (deterministic) weights under arbitrary dependence:

Definition 3 (Weighted Benjamini-Yekutieli (wBY) (Benjamini and Yekutieli, 2001; Blanchard and Roquain, 2008)).

Consider p-values $P_{1},\dotsc,P_{m}$ with arbitrary dependence such that the null p-values are super-uniform. Furthermore consider deterministic weights $w_{i}\geq 0$ such that $\sum_{i=1}^{m}w_{i}=m$ . Then the $\operatorname{FDR}$ is controlled at level $\alpha\in(0,1)$ by applying the weighted Benjamini-Yekutieli procedure at level $\alpha$ , i.e., the weighted Benjamini-Hochberg procedure (Definition 2) with $\tau=1$ at level $\alpha/\sum_{k=1}^{m}\frac{1}{k}$ .

We now show that applying the weighted BY procedure with cross-weighting controls the $\operatorname{FDR}$ under Assumption 2.

Theorem 4 (IHW-BY controls the $\operatorname{FDR}$ under honesty and independent folds).

Let $((P_{i},X_{i}))_{i\in[m]}$ satisfy Assumption 2 with respect to the partition $I_{1},\dotsc,I_{K}$ . Furthermore assume that we construct data-driven weights $W_{i}$ that are honest w.r.t. $I_{1},\dotsc,I_{K}$ (Specification 1). Then the weighted BY procedure (Definition 3) with p-values $P_{i}$ and weights $W_{i}$ controls the $\operatorname{FDR}$ at the nominal level $\alpha$ .

To demonstrate that honesty is essential for the result of Theorem 4, we next describe two plausible candidate methods for FDR control with covariates that do not control $\operatorname{FDR}$ :

Example 1 (BY with arbitrary data-driven weights does not control $\operatorname{FDR}$ under Assumption 2).

Theorem 4 may appear as a consequence of Theorem 4.2. of Blanchard and Roquain (2008), who extended the results of Benjamini and Yekutieli (2001) and proved that the weighted BY procedure (Definition 3) controls the $\operatorname{FDR}$ for any choice of weights and any p-value distribution. However, their result holds only for deterministic weights and not for data-driven weights, as we now demonstrate.

Proof.

We generate $((P_{i},X_{i}))_{i\in[m]}$ satisfying Assumption 2 and under the global null as follows: fix $m=2m^{\prime}$ for $m^{\prime}\in\mathbb{N}$ . We consider deterministic covariates $X_{i}=i$ and the partition $I_{1}=\left\{1,\dotsc,m^{\prime}\right\},I_{2}=\left\{m^{\prime}+1,\dotsc,m\right\}$ . We first draw a permutation $\mathcal{\sigma}$ from the uniform measure on the permutation group of $\left\{1,\dotsc,m^{\prime}\right\}$ . Next we independently draw: $U_{i}\sim U[(i-1)/m^{\prime},i/m^{\prime}]$ for $i=1,\dotsc,m^{\prime}$ and let $P_{i}=U_{\mathcal{\sigma}(i)}$ . Finally we draw independent $P_{m^{\prime}+1},\dotsc,P_{m}\sim U[0,1]$ . Weights are chosen as follows: Let $i^{*}\in\operatorname{argmin}_{i}\left\{P_{i}\right\}$ and then let $W_{i}=W(X_{i})=m\mathbf{1}\left(X_{i}=i^{*}\right)$ . Then the $\operatorname{FDR}$ of weighted BY at $\alpha$ is equal to $1$ as long as $m/\sum_{k=1}^{m}\frac{1}{k}>2/\alpha$ , as we now show:

Since the smallest p-value in $I_{1}$ is uniformly distributed on $U[0,1/m^{\prime}]$ , it follows that with probability $1$ , $P_{i^{*}}\leq 1/m^{\prime}$ and hence $P_{i^{*}}/W_{i^{*}}\leq 2/m^{2}<\alpha/(m\sum_{k=1}^{m}\frac{1}{k})$ . $H_{i^{*}}$ gets rejected and so $\operatorname{FDP}=1$ almost surely. ∎

In contrast, $\operatorname{FDR}$ control would be guaranteed, had we used weights derived through cross-weighting. BY with $\tau$ -censored weights (Specification 2) also does not control $\operatorname{FDR}$ , cf. Supplement S1.6.

Example 2 (AdaPT with BY correction does not control $\operatorname{FDR}$ under Assumption 2).

Lei and Fithian (2018) prove $\operatorname{FDR}$ control for AdaPT under full independence (cf. Assumption 1). Here we demonstrate that even with the Benjamini-Yekutieli correction, i.e., at level $\alpha/\sum_{k=1}^{m}\frac{1}{k}$ and $\tau$ -censoring (Specification 2), AdaPT does not control $\operatorname{FDR}$ under Assumption 2.

Proof.

We generate $((P_{i},X_{i}))_{i\in[m]}$ satisfying Assumption 2 and under the global null as follows: we fix $m=2m^{\prime},m^{\prime}\in\mathbb{N}$ and consider the partition $I_{1}=\left\{1,\dotsc,m^{\prime}\right\},I_{2}=\left\{m^{\prime}+1,\dotsc,m\right\}$ . We take constant covariates $X_{i}=1$ for all $i$ and draw $P_{1},P_{m^{\prime}+1}\,{\buildrel\text{iid}\over{\sim}\,}U[0,1]$ . Finally we set $P_{2},\dotsc,P_{m^{\prime}}=P_{1}$ and $P_{m^{\prime}+2},\dotsc,P_{m}=P_{m^{\prime}+1}$ . We then run the AdaPT algorithm at level $\alpha/\sum_{k=1}^{m}\frac{1}{k}$ with the initialization specified in Lei and Fithian (2018). Then $\operatorname{FDR}\geq 0.2925$ as long as $m/\sum_{k=1}^{m}\frac{1}{k}>2/\alpha$ , as we now show:

As specified in Section 4.4.1 of Lei and Fithian (2018), the AdaPT algorithm is initialized at threshold $0.45$ . Now call $A$ the event that $\left\{P_{1}\leq 0.45,P_{m^{\prime}+1}<0.55\right\}$ . On the event $A$ , on the first step of the algorithm, AdaPT estimates the FDP (cf. (14)) as $\left(1+\sum_{i}\mathbf{1}\left(P_{i}\geq 1-0.45\right)\right)/\sum_{i}\mathbf{1}\left(P_{i}\leq 0.45\right)$ , which is equal to $1/m^{\prime}$ if $P_{m^{\prime}+1}>0.45$ and equal to $1/m$ otherwise. In both cases, the estimated FDP is less or equal than $1/m^{\prime}$ and thus less than $\alpha/\sum_{k=1}^{m}\frac{1}{k}$ under our assumption on $m,\alpha$ . Thus AdaPT immediately terminates, rejecting all p-values in $I_{1}$ , and so $\operatorname{FDP}=1$ . Similarly $\operatorname{FDP}=1$ on the event $A^{\prime}=\left\{P_{1}<0.55,P_{m^{\prime}+1}\leq 0.45\right\}$ and $\operatorname{FDR}\geq\mathbb{P}[A\cup A^{\prime}]=0.2925$ . Finally, note that the above procedure is $\tau$ -censored with $\tau=0.45$ . ∎

4 Learning powerful weighting rules

Sections 2 and 3 focused on sufficient conditions for type-I error control, but did not address power. These conditions leave considerable flexibility in the choice of the class of possible weight functions, and in the method of selecting (or “learning”) these functions, given the data. This flexibility gives the analyst the opportunity to use domain-specific as well as statistical knowledge to make choices that have desirable type-II error properties. Nevertheless, it is useful to provide a default algorithm that works well across a range of settings. To this end, here we describe two schemes for learning weight functions, one for weighted $k$ -Bonferroni and one for weighted BH. Both rely on positing the approximate applicability of model (6), estimating quantities appearing therein and solving a convex program to find a weight function that optimizes the expected number of discoveries.

4.1 Learning weights for IHW $k$ -Bonferroni

The weighted $k$ -Bonferroni procedure with weight function $W(\cdot)$ rejects hypotheses that satisfy $P_{i}\leq k\alpha/mW(X_{i})$ . Under Model (6), a weight function maximizing the expected number of discoveries is one that maximizes $\sum_{i}\mathbb{P}[P_{i}\leq k\alpha/mW(X_{i})\mid X_{i}]=\sum_{i}F(k\alpha/mW(X_{i})\;|\;X_{i})$ . To derive honest weights (Specification 1) that approximately maximize this objective, we learn $\widehat{W}^{-\ell}$ for each fold $\ell$ separately as follows: first we estimate $F(t\mid x)$ from Model (6) by $\widehat{F}^{-\ell}(t\mid x)$ using only p-values and covariates outside of fold $\ell$ . Next, identifying $\widehat{W}^{-\ell}(\cdot)$ with the function’s values evaluated at the $X_{i}$ , i.e. $W_{i}=\widehat{W}^{-\ell}(X_{i}),\;i\in I_{\ell}$ we solve the $\left\lvert I_{\ell}\right\rvert$ -dimensional problem with optimization variables $\mathbf{w}=(w_{i})_{i\in I_{\ell}}$ :

[TABLE]

This setting allows for conditional distributions $\widehat{F}^{-\ell}(t\mid X_{i})$ that are different for tests with different covariates $X_{i}$ . We consider estimators $\widehat{F}^{-\ell}(t\mid x)$ that are concave in $t$ for all $x$ . This has the advantage of turning (8) into a convex optimization program, which is often tractable. Concavity of the distribution of p-values is a reasonable assumption and often provides a good fit to multiple testing datasets (Strimmer, 2008b; Genovese, Roeder, and Wasserman, 2006). However, the procedure works even when the concavity assumption does not hold: given any (potentially non-concave) pilot estimator of the conditional distribution function $t\mapsto F(t\mid x)$ , we can project it onto the set of concave distribution functions and solve the optimization problem with the projected distribution functions. We interpret the resulting procedure as a convex relaxation of (8) that makes computation tractable.

With this setup, we are ready to state a concrete weighting scheme, which proceeds in three steps: first, discretize the $X_{i}$ into a finite number of bins defined, e.g., by quantile slicing or as the leaves of a tree. Second, estimate $\widehat{F}^{-\ell}(t\mid\text{bin})$ by the Grenander estimator (Grenander, 1956), i.e., the least concave majorant of the empirical cumulative distribution function of the p-values $P_{i}$ with $i\in I_{\ell}^{c}$ and $X_{i}\in\text{bin}$ . Third, solve (8) for each $\ell$ by linear programming. The reason that (8) may be expressed as a linear program is that the Grenander estimator is always concave in $t$ and piecewise linear. We provide the details of the estimation and optimization procedures in Supplement S4.1; the computational complexity scales as $O(\log(m)\cdot m)$ .

An alternative ansatz is to specify $\pi_{0}(x)$ and $F_{\text{alt}}(\cdot\mid X_{i}=x)$ in the conditional two-groups model (6) parametrically. For instance, we may consider for $X_{i}\in\mathbb{R}^{p}$

[TABLE]

Such a Beta-Uniform mixture model has been considered in the setting without covariates, e.g., by Allison et al. (2002); Klaus and Strimmer (2011) and with covariates by Lei and Fithian (2018). In Supplement S4.2 we explain how to learn the parameters of the model using the expectation-maximization algorithm and how to optimize (8).

4.2 Learning weights for IHW Benjamini-Hochberg

Our starting point for deriving powerful weight functions for the weighted BH procedure (Definition 2) is again the conditional two-groups model (6). We seek a threshold function $s:\mathcal{X}\to[0,1]$ , such that the multiple testing procedure that rejects hypotheses with $P_{i}\leq s(X_{i})$ satisfies the following two properties: first, the marginal $\operatorname{FDR}$ , defined as $\text{mFDR}(s):=\mathbb{P}[H_{i}=0\mid P_{i}\leq s(X_{i})]$ is bounded by $\alpha$ , i.e., $\text{mFDR}(s)\leq\alpha$ and second, the expected number of discoveries $\sum F(s(X_{i})\;|\;X_{i})$ is large666Such a Bayesian, Neyman-Pearson type procedure is motivated by the asymptotic equivalence between the frequentist $\operatorname{FDR}$ and the mFDR (Genovese and Wasserman, 2004; Sun and Cai, 2007; Cai and Sun, 2009; Cai et al., 2019).. Similarly to our Bonferroni construction, we learn the threshold function $\widehat{s}^{-\ell}$ for each fold $\ell$ separately. To this end, we estimate $\widehat{F}^{-\ell}(t\mid x)$ and $\widehat{\pi}_{0}^{-\ell}(x)$ out of fold. Noting that $\text{mFDR}(s)\leq\alpha$ is implied by $\mathbb{E}\left[\pi_{0}(X_{i})s(X_{i})\right]\leq\alpha\mathbb{E}\left[F(s(X_{i})\mid X_{i})\right]$ , we propose solving:

[TABLE]

As our goal is to apply the weighted BH procedure, we convert these thresholds $t_{i}$ into weights $W_{i}$ through normalization: for $i\in I_{\ell}$ , set $W_{i}=\left\lvert I_{\ell}\right\rvert\cdot t_{i}/(\sum_{i\in I_{\ell}}t_{i})$ , unless the denominator is [math], in which case $W_{i}=1$ . A few remarks are in order: similarly to optimization problem (8), (10) is also a convex program if $\widehat{F}^{-\ell}(t\mid x)$ is concave in $t$ for all $x$ , and may be expressed as a linear program if the Grenander estimator is used. We thus again suggest to discretize $X_{i}$ and estimate distributions with the Grenander estimator. If the weights will be applied in conjunction with the weighted BH algorithm, we suggest to simply set $\widehat{\pi_{0}}^{-\ell}\equiv 1$ . This optimization and estimation scheme was proposed by Ignatiadis et al. (2016). Alternatively, $\widehat{\pi_{0}}^{-\ell}(x)$ may be estimated by applying Storey’s null proportion estimator (Storey et al., 2004) to all hypotheses outside fold $I_{\ell}$ that fall into the same bin as $x$ . Details of the estimation and optimization procedures are provided in Supplement S4.1.

The weights $W_{i}$ constructed above are honest (Specification 1). Yet, in view of Theorems 1 and 2, it might appear unsatisfying that $W_{i}$ do not satisfy the $\tau$ -censored weights condition (Specification 2). In our experience, the proposed procedure with the Grenander estimator does not overfit and controls the $\operatorname{FDR}$ . This is corroborated by extensive simulations below and by the asymptotic guarantees of Proposition 1.

Our alternative proposal, which satisfies $\tau$ -censoring (Specification 2), is to fit the Beta-Uniform mixture model (LABEL:eq:betamix_simulation_model). The EM algorithm may be modified to accommodate for censored knowledge of $P_{i}\leq\tau$ ; cf. Markitsis and Lai (2010) in the setting without covariates. Furthermore, under model (LABEL:eq:betamix_simulation_model), the solution to problem (10) lies on a contour of equal conditional local fdr (cf. Theorem 2 in Lei and Fithian (2018)), and this fact facilitates the optimization. We describe the steps in more detail in Supplement S4.2.

Finally, we use the same framework to derive weights for the weighted Benjamini-Yekutieli procedure (Definition 3): we proceed as for weighted BH but solve (10) with $\alpha$ replaced by $\alpha/\sum_{k=1}^{m}\frac{1}{k}$ . In this case, honesty suffices for $\operatorname{FDR}$ control (Theorem 4).

5 Numerical experiments

Our goal in this section is to corroborate through simulations of three important settings—grouped multiple testing, multiple testing with continuous covariates and simultaneous two-sample tests—the following claims: first, some methods with asymptotic FDR control guarantees do not control FDR in finite samples. Second, IHW is a flexible framework for multiple testing, its main advantage over other methods being finite sample error control (due to cross-weighting), while remaining competitive in terms of power. Throughout this section, we define power as

[TABLE]

The expectation, just as the $\operatorname{FDR}$ , is evaluated through averaging over Monte Carlo replicates.

5.1 Grouped multiple testing

We first consider the multiple testing problem with groups, i.e, with categorical covariates $X_{i}\in[G]$ . In each simulation we generate $(P_{i},H_{i},X_{i}),\;i=1,\dotsc,20000$ , independently, as follows

[TABLE]

In words, there are 40 latent groups defined by $\tilde{X}_{i}$ , each with 500 hypotheses. A quarter of the groups has non-nulls, three quarters do not. The alternative signal strength $\mu(\cdot)$ and null proportion $\pi(\cdot)$ vary linearly across non-null groups. Parameters are chosen so that the overall proportion of nulls is $0.9$ . We then coarsen $\tilde{X}_{i}$ to $X_{i}=\lceil\tilde{X}_{i}/40\cdot G\rceil$ , with $G$ varying across simulations; $X_{i}$ is non-latent, i.e., visible to the algorithm. For example, for $G=2$ , $X_{i}$ takes on only two levels (2 groups), while for $G=40$ , $X_{i}=\tilde{X}_{i}$ takes on all 40 levels. We also use the above configuration of covariates and simulate under the global null by drawing all p-values from the uniform distribution.

We compare the following seven methods:

The Benjamini-Hochberg (BH) method (Benjamini and Hochberg, 1995), which ignores the covariates $X_{i}$ . 2. 2.

The stratified BH procedure (SBH) (Sun et al., 2006; Efron, 2008), wherein the BH procedure is applied $G$ times separately to p-values corresponding to different levels of $X_{i}$ . 3. 3.

The Clfdr (conditional local fdr) procedure of Cai and Sun (2009), which applies an optimal decision rule that rejects hypotheses with a low value of the group-wise local fdr (cf. Algorithm 4 in Supplement S3). We apply a data-driven version of the oracle rule by estimating local fdrs within each group with the fdrtool CRAN Package (Strimmer, 2008a), which estimates marginal densities with the Grenander estimator. 4. 4.

The Group Benjamini-Hochberg (GBH) procedure of Hu et al. (2010) with null-proportion adaptivity, as described in Algorithm 1 ( $\tau=0.5$ ). 5. 5.

The IHW-GBH procedure with null-proportion adaptivity, as described in Algorithm 2 ( $\tau=0.5$ ) with hypotheses randomly split into 5 folds. 6. 6.

The IHW-Storey-Grenander procedure: the IHW-Storey method (Theorem 2) with hypotheses randomly split into 5 folds and data-driven weights based on the Grenander estimator described in Section 4.2 and Supplement S4.1. 7. 7.

The Structure Adaptive Benjamini-Hochberg algorithm (SABHA) by Li and Barber (2019): SABHA first estimates $\widehat{\pi}_{0}(\cdot)$ for each group by solving a joint convex optimization problem. Then, the $\tau$ -censored, weighted BH procedure is applied with weights $W_{i}=1/\widehat{\pi}_{0}(X_{i})$ . We set the tuning parameters of group-wise SABHA to $\tau=0.5,\;\varepsilon=0.1$ following Section 7.1 of Li and Barber (2019).

All of the above methods provably control $\operatorname{FDR}$ asymptotically, as $m\to\infty$ , the number of groups remains fixed and there is signal in the data, but only BH and IHW-GBH have provable finite-sample $\operatorname{FDR}$ control at $\alpha$ and SABHA at $\alpha(1+10\sqrt{G/m})$ (Li and Barber, 2019, Lemma 2).

Results are shown in Figure 3. Under the global null (Fig. 3A), SBH strongly overfits, since under the global null the $\operatorname{FDR}$ is equivalent to the FWER, so it would need to pay a Bonferroni correction to apply BH separately to each group. Clfdr has $\operatorname{FDR}$ much below nominal for a small number of groups (the oracle local fdr procedure would not reject anything under the global null), but as the number of groups increases, it no longer controls $\operatorname{FDR}$ . We further discuss this below. All other methods control $\operatorname{FDR}$ in this setting. For GBH, however, recall Fig. 2 for a situation where it does display a pronounced loss of $\operatorname{FDR}$ control.

For the simulations with signal (Fig. 3B, C) we make the following observations: as $G$ increases, the covariates become more informative, hence in principle power can be increased. Indeed this is precisely what we observe (Fig. 3C) for the grouped methods that do not directly estimate the distribution in each group (all methods except Clfdr and IHW-Storey-Grenander). The power of BH remains constant. After BH, the least powerful procedure appears to be SABHA; the suboptimality of its weighting scheme has been previously pointed out (Lei and Fithian, 2018). We also observe that IHW-GBH matches the power of GBH and has the added advantage of provable finite-sample $\operatorname{FDR}$ control. Regarding the methods that estimate the distribution, when $G$ is small relative to $m$ , then the Grenander estimator can precisely estimate the distribution in each bin. This translates into the Clfdr procedure and IHW-Storey-Grenander outperforming the other methods at small $G$ ; indeed Clfdr is provably asymptotically the most powerful procedure in this setting. However, as $G$ increases and the amount of data in each group decreases, the distributions are not estimated as accurately. The consequence for Clfdr is loss of $\operatorname{FDR}$ control, while IHW-Storey-Grenander retains $\operatorname{FDR}$ control due to cross-weighting. In conclusion, in this set of simulations, IHW is the most powerful method of those that control $\operatorname{FDR}$ .

5.2 Multiple testing with continuous covariates

In this section we explore a setting with a two-dimensional, continuous covariate $X_{i}$ . We seek to compare IHW, AdaPT and local fdr based methods with an emphasis on understanding behavior under model-misspecification (to be made precise momentarily). We simulate independent $(X_{i},H_{i},P_{i}),i=1,\dotsc,10000$ from the conditional two-groups model (6) with the following choices for $\mathbb{P}^{X},\pi_{0}(x)$ and $F_{\text{alt}}(\cdot\mid X_{i}=x)$ :

[TABLE]

$\bar{\beta}\in[1,3]$ is a parameter that varies across simulation settings. The two-dimensional covariates $X_{i}$ modulate both the null proportion $\pi_{0}(X_{i})$ and the signal in the alternative density. We compare six methods.

The Benjamini-Hochberg (BH) (Benjamini and Hochberg, 1995) method ignoring $X_{i}$ . 2. 2.

The oracle Clfdr procedure (Clfdr-oracle) that rejects hypotheses with a small conditional local $\operatorname{fdr}$ , $\operatorname{fdr}(P_{i}|X_{i}):=\mathbb{P}[H_{i}=0|X_{i},P_{i}]$ with a threshold chosen through Algorithm 4 in Supplement S3. This procedure achieves an optimal trade-off between the false nondiscovery rate and the false discovery rate, cf. Sun and Cai (2007); Cai and Sun (2009). Clfdr-oracle, however, would not be available to an analyst, as it assumes oracle knowledge of the components (LABEL:eq:adapt_simulation_model) in model (6). 3. 3.

The IHW-BH-Grenander procedure, similarly to the previous section, but without null-proportion adaptivity (i.e., with IHW-BH instead of IHW-Storey). The covariates $X_{i}\in[0,1]^{2}$ are binned into $5\times 5$ equal volume bins.

Furthermore, we compare three methods that fit Model (LABEL:eq:betamix_simulation_model) as a misspecified working model for the true model (LABEL:eq:adapt_simulation_model) using the EM algorithm (details in Supplement S4.2).

Clfdr-EM: this is the same as Clfdr-oracle, but instead of true quantities we use the ones estimated by maximum likelihood on the misspecified model (LABEL:eq:betamix_simulation_model). We employ the EM algorithm since the status $H_{i}\in\left\{0,1\right\}$ is unknown. 2. 5.

IHW-Storey-BetaMix: this is the IHW-Storey method with hypotheses split randomly into 5 folds and weights derived from optimization problem (10) based on the (out-of-fold) estimated working model (LABEL:eq:betamix_simulation_model). Here the EM algorithm deals with both unknown $H_{i}$ and unknown value of censored p-values $P_{i}\leq\tau$ with $\tau=0.1$ . 3. 6.

AdaPT, as implemented in the adaptMT CRAN package, wherein in each iteration the working model (LABEL:eq:betamix_simulation_model) is fitted. The EM algorithm deals with unknown $H_{i}$ and for a subset of hypotheses (“masked hypotheses”) the algorithm only has access to $\min\left\{P_{i},1-P_{i}\right\}$ instead of $P_{i}$ .

The results are shown in Fig. 4. As expected from theory, Clfdr-oracle controls the $\operatorname{FDR}$ and is most powerful. Clfdr-EM is also powerful, however because of misspecification in model (LABEL:eq:betamix_simulation_model), it does not control the $\operatorname{FDR}$ . All other algorithms control the $\operatorname{FDR}$ . Among these, AdaPT is most powerful, closely followed by IHW-Storey-BetaMix and then by IHW-BH-Grenander; all of these procedures improve substantially upon BH.

Breaking AdaPT:

Fig. 4 demonstrates that AdaPT is very powerful for multiple testing in model (LABEL:eq:adapt_simulation_model). However, under two conditions (more of which, below), AdaPT’s power (but not $\operatorname{FDR}$ control guarantees) can be diminished, even under independence. To explain these two conditions, we first provide a summary of how AdaPT works. In iteration $j$ of AdaPT, a candidate rejection function $s_{j}:\mathcal{X}\to[0,1]$ is maintained and hypotheses that satisfy $P_{i}\leq s_{j}(X_{i})$ are in the provisional rejection set. The false discovery proportion at step $j$ is estimated by the Barber and Candès (2015) estimator (cf. Arias-Castro and Chen (2017)):

[TABLE]

If $\widehat{\operatorname{FDP}}_{j}\leq\alpha$ , the algorithm terminates and returns the current rejection set. Otherwise the rejection region $s_{j}$ is further shrunk to $s_{j+1}$ with $s_{j+1}(x)\leq s_{j}(x)$ for all $x$ . The iteration continues until either the stopping criterion is satisfied or the empty set is returned.

The first complication of (14) is that AdaPT must reject at least $1/\alpha$ hypotheses or none at all. For example, for $\alpha=0.05$ , if there are 19 very small p-values, AdaPT may not be able to reject them, even if BH could. Hence AdaPT has low power in situations with very sparse signals, where the best one could hope for is to detect a handful of hypotheses. This is apparent in Figure 4, in the lowest signal situation $(\bar{\beta}=1.0)$ . There, AdaPT has $\operatorname{FDR}$ substantially below the nominal $\alpha$ and furthermore has lower power than IHW-Storey-BetaMix.

The second complication is that AdaPT can be conservative when the null p-value distribution is strictly super-uniform instead of uniform, because the numerator in (14) will overestimate the false discoveries. In applications, a strictly super-uniform distribution is typically caused by discrete p-values or when the researcher is testing for a one-sided alternative using a test calibrated to effect size zero, but many nulls have an effect in the opposite direction. To explore such enrichment of large p-values, we repeat the previous simulation with $P_{i}\mid(H_{i}=0)\;\sim\;(1-\kappa)\,U[0,1]+\kappa\,\text{Beta}(1,0.5)$ , varying $\kappa$ $\in[0,0.1]$ and fixed $\bar{\beta}=2$ . Our previous simulations correspond to $\kappa=0$ , which yields the uniform null distribution. Fig. 5A shows the null density as $\kappa$ varies, and panels B,C show the results of the simulation. We see that as $\kappa$ increases, the $\operatorname{FDR}$ of AdaPT quickly drops below the nominal $\alpha$ and as a consequence, power deteriorates.

5.3 Simultaneous two-sample testing

In this section we provide an example of a covariate $X_{i}$ that is random and arises from statistical (rather than domain-specific) considerations. We study simultaneous two-sample testing for equality of means following Cai et al. (2019). For the $i$ -th hypothesis we observe

[TABLE]

(everything jointly independent). We are interested in testing $H_{i}:\mu_{Y,i}=\mu_{V,i}$ , $i=1,\dotsc,m$ and assume the variances $\sigma_{i}^{2}$ are known777The results extend to unequal sample sizes and to unknown variance. We refer the reader to Supplement S6.2.2 and Bourgon et al. (2010); Liu (2014); Cai et al. (2019).. The optimal test statistic (in single hypothesis testing (Lehmann and Romano, 2005)) for this situation is the two-sample $z$ -statistic $Z_{i}:=\sqrt{n/2}\left(\overline{Y_{i}}-\overline{V_{i}}\right)/\sigma_{i}$ , where $\overline{Y_{i}}$ and $\overline{V_{i}}$ are the sample means in each group. The p-values can be calculated as $P_{i}=2\left(1-\Phi(\left\lvert Z_{i}\right\rvert)\right)$ , where $\Phi$ is the Standard Normal CDF. A basic multiple testing approach consists of applying BH to the p-values $P_{i}$ .

In addition, denote by $\hat{\mu}_{i}\coloneqq\left(\overline{Y_{i}}+\overline{V_{i}}\right)/2$ the pooled average and let $X_{i}\coloneqq\sqrt{2n}\hat{\mu}_{i}/\sigma_{i}$ . A direct covariance calculation reveals that $\text{Cov}(X_{i},Z_{i})=0$ , and so $X_{i}$ and $Z_{i}$ are independent (note the joint normality). Hence we may apply the IHW framework with p-values $P_{i}$ and covariates $X_{i}$ .

In single hypothesis testing, there is nothing to be gained from $X_{i}$ and its usefulness only emerges in the multiple testing setup. $X_{i}$ is a test statistic for the null hypothesis $\mu_{Y,i}=\mu_{V,i}=0$ . If we believe a-priori that for many of the hypotheses $i$ with $\mu_{Y,i}=\mu_{V,i}$ , a sparsity condition holds, so that in fact $\mu_{Y,i}=\mu_{V,i}=0$ , then large absolute values of this statistic are more likely to correspond to alternatives. Note that we did not actually re-specify our null hypothesis from $\mu_{Y,i}=\mu_{V,i}$ to $\mu_{Y,i}=\mu_{V,i}=0$ . We just assumed properties of the null hypotheses to motivate a choice of covariate, and are still testing for $\mu_{Y,i}=\mu_{V,i}$ .

In the simulation, which is similar to simulations in Cai et al. (2019), we generate data from model (15) with $m=10000$ , $n=50$ , $\sigma_{i}=1$ for all $i$ . Furthermore, we vary $m_{1}$ , the number of alternatives and let

[TABLE]

That is, only the first $m_{1}$ hypotheses are alternatives. The next $m-m_{1}$ hypotheses are nulls with the last $m-2m_{1}$ also being nulls with respect to the screening null $\mu_{Y,i}=\mu_{V,i}=0$ . We compare five methods.

The Benjamini-Hochberg (BH) procedure applied to $P_{i}$ and ignoring $X_{i}$ . 2. 2.

The CARS procedure (covariate-assisted ranking and screening) (Cai et al., 2019): CARS is a multiple testing procedure designed specifically for simultaneous two-sample tests based on $Z_{i}$ and $X_{i}$ . At a high level, CARS learns a function $(z,x)\mapsto$ $\hat{s}_{\text{CARS}}(z,x)$ and a threshold $\hat{t}_{\text{CARS}}$ and rejects all hypotheses such that $\hat{s}_{\text{CARS}}(Z_{i},X_{i})$ $\leq\hat{t}_{\text{CARS}}$ . Asymptotically, CARS controls the $\operatorname{FDR}$ and learns the optimal decision boundary. We use the default settings of the CARS function (option="regular") in the CARS R package. 3. 3.

CARS-sparse: a modification of CARS, also proposed by Cai et al. (2019), that is more conservative and empirically alleviates loss of $\operatorname{FDR}$ control in situations with sparse signals (option="sparse" in the CARS package). 4. 4.

IHW-Storey-CARS: we use IHW-Storey (Theorem 2) in conjunction with a honest (but not $\tau$ -censored) weighting heuristic based on CARS. We partition hypotheses randomly into 5 folds $I_{1},\dotsc,I_{5}$ . To choose weights for $I_{\ell}$ we proceed as follows: first, we run CARS on the remaining 4 folds and get $\hat{s}^{-\ell}_{\text{CARS}}(\cdot,\cdot)$ and $\hat{t}_{\text{CARS}}^{-\ell}$ . Then, for $i\in I_{\ell}$ , we let $t_{i}$ be the smallest threshold at which $H_{i}$ would get rejected,

[TABLE]

Then we let $\tilde{W}_{i}=2\left(1-\Phi(t_{i})\right)$ , $W_{i}=\left\lvert I_{\ell}\right\rvert\tilde{W}_{i}/\sum_{j\in I_{\ell}}\tilde{W}_{j}$ and finally apply the IHW-Storey procedure from Theorem 2. 5. 5.

IHW-Storey-Grenander, as in the grouped multiple testing simulations of Section 5.1; we discretize the covariate $X_{i}$ into 10 groups with 1000 observations each.

The results are shown in Fig. 6. With sparse signal (small $m_{1}$ ), CARS fails to control the $\operatorname{FDR}$ . This observation had also been made by Cai et al. (2019), who therefore proposed a modification, CARS-sparse, which indeed controls $\operatorname{FDR}$ in our simulation, as do all other methods. On the other hand, IHW-Storey-CARS is easy to implement—using existing software for CARS—and turns out to have more power in the simulations than CARS-sparse. IHW-Storey-Grenander also has more power than CARS-sparse.

6 Application example: biological high-throughput data

Grubert et al. (2015) assayed cell lines derived from 75 human individuals for the status of their single nucleotide polymorphisms (SNPs, i.e., differences that exist between the genome sequences of individuals) and a biochemical modification of DNA-associated molecules called H3K27ac. We tested all within-chromosome associations by marginal regression of the quantitative readout from the ChiP-seq assay for H3K27ac on the polymorphisms, which are encoded as categorical variables with levels aa, ab, bb, using the software Matrix eQTL (Shabalin, 2012). Here we restrict ourselves to associations in Chromosomes 1 and 2, for which Grubert et al. reported the status of $N_{1}=645452$ and $N_{2}=699343$ SNPs and the H3K27ac levels at $K_{1}=12193$ and $K_{2}=11232$ genomic positions (“peaks”) on these chromosomes. This results in a total of approximately 16 billion hypotheses ( $m=N_{1}\times K_{1}+N_{2}\times K_{2}\approx 1.6\cdot 10^{10}$ )888We note that computing and storing 16 billion p-values puts notable demands on computing infrastructure. Therefore, a common choice made by implementations such as Matrix eQTL (Shabalin, 2012) to reduce storage requirements is to only report p-values below some threshold (e.g., in this case, below $10^{-4}$ ). Benjamini-Hochberg/Yekutieli and IHW-BH/BY can deal with this seamlessly by operating as if the right-censored p-values were equal to $1$ . In contrast, AdaPT depends on the large p-values to estimate the $\operatorname{FDR}$ , cf. (14).. Figure 1 shows the marginal histogram of the p-values and illustrates how these p-values are related to the genomic distance between SNP and H3K27ac peak. This covariate is motivated from biological domain knowledge: associations across shorter distances are a-priori more plausible and empirically more frequent.

We compare two different approaches of dealing with the multiplicity, while controlling the FDR:

The Benjamini-Yekutieli (BY) procedure on the $m$ p-values (at level $\alpha=0.01$ ): such a conservative procedure is justified, since p-values for the same H3K27Ac peak and different, but genetically linked SNPs will be strongly dependent. 2. 2.

The IHW-BY-Grenander method (at level $\alpha=0.01$ ) using as covariate the genomic distance between SNP and H3K27ac peak and weights based on the Grenander estimator after binning based on genomic distance; cf. Section 4.2 and Supplement S4.1 for a description of the algorithm and Supplement S5 for application-specific details. To satisfy Assumption 2 and hence have guaranteed $\operatorname{FDR}$ control by Theorem 4, we partition p-values into two folds corresponding to the different chromosomes. The data for these are, to sufficient approximation, independent.

The results are shown in Figure 7. IHW more than doubles the discoveries compared to the unweighted procedure while maintaining all formal guarantees of FDR control. Panel A shows the learned weight functions for the two folds. Upon applying the weighted BY procedure, the weights translate into thresholds for rejection: hypothesis $i$ is rejected if $P_{i}\leq W_{i}\;\hat{t}_{\text{IHW}}^{*}$ for some common choice of $\hat{t}^{*}_{\text{IHW}}$ and hypothesis-dependent $W_{i}$ (Panel D). In contrast, the BY procedure uses the same rejection threshold $\hat{t}_{\text{BY}}^{*}$ for all hypotheses (Panel C). As a consequence, the BY procedure had to be relatively stringent throughout, while IHW could be permissive at smaller and stringent only at higher distances.

There is another interpretation explaining why IHW increases power: it attempts to set thresholds in a way that balances the conditional local false discovery rate ( $\operatorname{fdr}$ ), at least among the non-zero thresholds. This is shown in Panel F. Indeed, under certain assumptions, the optimal decision boundary is one of constant local $\operatorname{fdr}$ , cf. Lei and Fithian (2018, Theorem 2). On the other hand, since BY thresholds only depend on the p-values, the local fdr varies widely and increases as a function of genomic distance, as seen in Panel E.

Finally, we note that the estimation method for the local fdr in Panels E and F is the same that was used to derive the weights. The local fdr estimates appear to be noisy; even inaccurate estimates of the local fdr can lead to powerful weights (increase in number of discoveries). Furthermore, the frequentist guarantees of type-I error control of IHW are independent of and unaffected by (in)accuracies of the local fdr estimate.

7 Further relations to previous work

Throughout this manuscript we have emphasized the relationship of the present research to previous work. In particular, in our numerical study in Section 5 we compared IHW to previously developed methods for grouped multiple testing, multiple testing with continuous covariates and simultaneous two-sample testing. In this section we provide some further connections of IHW to previous work.

7.1 Ignatiadis, Klaus, Zaugg, and

Huber (2016)

The idea of cross-weighting for $\operatorname{FDR}$ control was introduced as one of three empirically promising heuristics by Ignatiadis, Klaus, Zaugg, and Huber (2016); the other two heuristics being convex relaxations and regularization of the weights towards unity and/or low total variation. The contribution of this paper relative to Ignatiadis et al. (2016) is to clarify essential versus circumstantial concepts (e.g., Ignatiadis et al. (2016) only considered one possibility for weighting hypotheses through the Grenander estimator) and to establish formal, finite-sample FDR control for IHW-BH. We also show how the fundamental idea of cross-weighting applies beyond independence and introduce cross-weighted variants of the $k$ -Bonferroni and BY procedures for $k$ -FWER and FDR control under dependence.

7.2 Sample splitting

One of the initial attempts at data-driven weights (Rubin et al., 2006) used another form of data-splitting: consider the setting where we start with a $m\times n$ data-matrix from which we get our p-values $P_{i}$ by calculating the test statistic in a row-wise fashion, say by applying a $t$ -test for each row. Then one can calculate $m$ “prior” p-values $P_{i}^{\prime\prime}$ based on $n_{1}<n$ columns and derive prior weights $W_{i}$ based on $P_{i}^{\prime\prime}$ . The remaining $n-n_{1}$ columns are used to compute p-values $P_{i}^{\prime}$ . Finally, a weighted multiple testing procedure is applied with p-values $P_{i}^{\prime}$ and weights $W_{i}$ . However, the authors then show that in this case it is more powerful to simply use an unweighted procedure with p-values $P_{i}$ calculated based on the whole dataset, rather than a weighted procedure with sample-splitting. Habiger and Peña (2014) pursue a similar approach. For IHW, we instead split horizontally (on hypotheses) rather than vertically (on samples), and the p-values $P_{i}$ are unaltered.

7.3 The weighted False Discovery Rate

In this work, we have studied heterogeneous multiple testing with the aim of increasing power, while controlling the $k$ -FWER or the $\operatorname{FDR}$ . However, in light of non-exchangeability, the cost of a false discovery to the researcher may not be uniform, but vary across hypotheses; e.g., it may be equal to $a_{i}\geq 0$ for hypothesis $H_{i}$ . Then it is of scientific interest to control the weighted $\operatorname{FDR}$ of Benjamini and Hochberg (1997) defined as

[TABLE]

Similarly, the utility (benefit) $b_{i}$ of a true discovery may vary across hypotheses. Then, instead of maximizing the expected number of discoveries (cf. Section 4), it may be more pertinent to maximize the expected total benefit. Basu et al. (2018) study optimal oracle procedures that achieve this optimization goal subject to control of $\text{wFDR}(\textbf{a})$ , as well as data-driven procedures that achieve the same goal asymptotically. In future work it would be of interest to study whether cross-weighting may be applied to derive flexible and powerful procedures with finite-sample control of $\text{wFDR}(\textbf{a})$ . We expect this to be tractable –for example by leveraging the results of Ramdas et al. (2019)– and useful if the utility $b_{i}$ is a function of the covariates, i.e., $b_{i}=b(X_{i})$ .

8 Discussion

Despite the ubiquitous uptake by the natural sciences of the concepts of multiple testing (and in particular the FDR), and despite ever growing volumes of data and possible hypothesis tests, surprisingly little attention has been paid to systematic approaches to account for hypothesis heterogeneity in order to increase detection power. While this may be justifiable in situations where power is large anyway, in many cases the costs of the underlying experiments or studies are substantial and increase with sample size, and the question of power decides over success or failure. In such cases, an approach that increases power compared to a baseline analysis, at no cost and by purely computational means, should be of interest.

Our approach is an instance of the value of large scale data (Efron, 2010): due to dataset size, modeling and inference opportunities open up that were previously irrelevant or impossible. In addition to the p-values $P_{i}$ , our approach uses two further inputs: the covariates $X_{i}$ and the fold assignment. These are different concepts and their construction is unrelated to each other. The $X_{i}$ are informative about power and/or prior probability of the tests, but independent of $P_{i}$ under the null hypothesis. Meanwhile, the folds are constructed as a device for the cross-weighting scheme, in order to achieve type-I error control: we want independence of folds so that the weights do not lead to overfitting. Their choice is unrelated to power. Random folds are an easy default, but to get independent folds, it is then necessary to require global independence (Assumption 1). When global independence cannot be assumed, the dependences are in many application scenarios—loosely speaking—“local” (under some suitable choice of metric on the set of hypotheses). This can be used to construct folds that are independent, at least to sufficient approximation. Making such loose speak more precise requires specification of individual application scenarios and the associated domain knowledge, as in the example of Section 6.

If, for a dataset at hand, independent folds cannot be achieved by any available fold-splitting scheme, it is possibly better not to try to address the dependences at the level of the multiple testing procedure, but upstream: strong, dataset-wide dependences often signal the need for a fundamental rethink of the analysis approach.

Sometimes, dataset-wide dependences are caused by so-called batch effects. They are undesirable, uninteresting with respect to the scientific question, and can be reduced or avoided by good experimental design (Leek et al., 2010). Once they are a matter of fact, it is sometimes possible to remove them by mapping the data to a new set of properly “normalized” and “batch-corrected” variables (Leek and Storey, 2008; Stegle et al., 2010; Wang et al., 2017).

If avoiding dependence by modifying the analysis upstream of the multiple testing treatment is not possible, the analyst should also consider whether multiple marginal hypothesis tests are indeed more appropriate than, say, dimension reduction, or a multivariate model with $\operatorname{FDR}$ guarantees (Candès et al., 2018; Sesia et al., 2019; Ren and Candès, 2020).

Code availability and reproducibility

The study is made fully third-party reproducible, and we provide its code in Github under the link https://github.com/Huber-group-EMBL/covariate-powered-cross-weighted-multiple-testing. The Bioconductor package IHW (http://bioconductor.org/packages/IHW) provides a user-friendly implementation of IHW-BH/Storey based on the Grenander estimator.

Acknowledgments

We thank Judith Zaugg for making available data for the example in Section 6, and Edgar Dobriban, William Fithian, Susan Holmes, Lihua Lei, Michael Love, Gesthimani Roumpani, Stelios Serghiou, Michael Sklar, Youngtak Sohn, Oliver Stegle, Mark van de Wiel and Britta Velten for helpful discussions and critical comments on the manuscript. We thank Stefan Wager, an anonymous associate editor and two anonymous reviewers for feedback that motivated us to substantially improve the manuscript. Michael Sklar proposed the counterexample from Supplement S1.3. W.H. acknowledges support from the German Federal Ministry of Education and Research, Grant MOFA, under grant contract No. 031L0171A. N.I. acknowledges support from a Ric Weiland Graduate Fellowship.

Supplement S1: Finite-sample results for FDR control of IHW

Throughout Supplementary Section S1, the weights $W_{i}$ are considered random. Occasionally we explicitly condition on the weights; in which case we verify how the conditioning on (subsets of) weights influences conditional distributions.

S1.1 A preliminary lemma

They key property of IHW that enables finite-sample type-I error control is the following: cross-weighting makes the p-values and their weights independent of each other. This was already demonstrated in the beginning of the proof of Theorem 3 in Section 3.2. Here we formalize this result through the following Lemma:

Lemma 1.

Let $(W_{i})_{i\in[m]}$ be honest weights (Specification 1) w.r.t. the partition $I_{1},\dotsc,I_{K}$ of $[m]$ .

If $((P_{i},X_{i}))_{i\in[m]}$ satisfy Assumption 2, then:

(a)

For all $\ell\in[K]$ and all $i\in\mathscr{H}_{0}\cap I_{\ell}$ , $P_{i}$ is independent of $(W_{k})_{k\in I_{\ell}}$ . In particular $P_{i}$ is independent of $W_{i}$ ( $P_{i}\perp W_{i}$ ) for all $i\in\mathscr{H}_{0}$ .

The conclusion may be strengthened if instead $((P_{i},X_{i}))_{i\in[m]}$ satisfy Assumption 1:

(a’)

For all $\ell\in[K]$ , $(P_{i})_{i\in\mathscr{H}_{0}\cap I_{\ell}}$ is independent of $(W_{i})_{i\in\mathscr{H}_{0}\cap I_{\ell}}$ . 2. (b’)

For all $\ell\in[K]$ , $(P_{i})_{i\in\mathscr{H}_{0}\cap I_{\ell}}$ are jointly independent and super-uniform conditionally on $(W_{i})_{i\in\mathscr{H}_{0}\cap I_{\ell}}$ .

Proof.

We prove (a); the other statements follow similarly. Fix $\ell\in[K]$ and let $i\in\mathscr{H}_{0}\cap I_{\ell}$ . By definition of honesty (Specification 1), $(W_{k})_{k\in I_{\ell}}$ is a function only of $(P_{i})_{i\in I_{\ell}^{c}}$ and $\mathbf{X}=(X_{i})_{i\in[m]}$ . It thus suffices to argue that $P_{i}$ is independent of $((P_{i})_{i\in I_{\ell}^{c}},\mathbf{X})$ . Writing the latter as $((P_{i})_{i\in I_{\ell}^{c}},(X_{i})_{i\in I_{\ell}^{c}},(X_{i})_{i\in I_{\ell}})$ we conclude as a consequence of parts (a) and (b) of Assumption 2. ∎

S1.2 The IHW-BH procedure under independence: Proof of Theorem 1

Proof.

Let $\mathbf{W}$ be the weights and $\hat{k}$ the number of discoveries after applying IHW-BH at level $\alpha$ and with censoring level $\tau$ . Also write $\mathbf{X}=(X_{1},\dotsc,X_{m})$ , $\mathbf{P}=(P_{1},\dotsc,P_{m})$ and $\mathbf{1}(\mathbf{P}\leq\tau)=(\mathbf{1}(P_{1}\leq\tau),\dotsc,\mathbf{1}(P_{m}\leq\tau))$ . Here $\mathbf{1}(P_{i}\leq\tau)$ is the indicator function that is $1$ when $P_{i}\leq\tau$ and [math] otherwise.

We first give a high level idea regarding the proof. To bound the $\operatorname{FDR}$ we seek to bound expectations of $\mathbf{1}(H_{i}\text{ rejected})/(\hat{k}\lor 1)$ , i.e., of $\mathbf{1}(P_{i}\leq\alpha W_{i}\hat{k}/m,\;P_{i}\leq\tau)/(\hat{k}\lor 1)$ where $i$ is null.999We use the notation $a\lor b=\max\left\{a,b\right\}$ , $a\land b=\min\left\{a,b\right\}$ . If $W_{i},\hat{k}$ were independent of $P_{i}$ , then we could directly upper bound this expectation by $\mathbb{E}[(\alpha W_{i}\hat{k}/m)/(\hat{k}\lor 1)]\leq\mathbb{E}[\alpha W_{i}/m]$ from which $\operatorname{FDR}$ control would follow by summing over all $i$ . Honesty (Specification 1) makes—in the way of Lemma 1— $P_{i}$ and its weight $W_{i}$ (for a single null $i$ ) independent. However, $P_{i}$ directly influences $\hat{k}$ . This is true also for unweighted BH and weighted BH with deterministic weights, yet here $P_{i}$ also indirectly influences $\hat{k}$ through the weights $W_{j},\;j\neq i$ . Nevertheless, we will argue that the conclusion may still be salvaged: $\tau$ -censoring (Specification 2) ensures that on the event $\left\{P_{i}\leq\tau\right\}$ the exact value of $P_{i}$ cannot influence weights $W_{j},\;j\neq i$ . Furthermore, it suffices to only consider the event $\left\{P_{i}\leq\tau\right\}$ (in turn for each null $i$ ), since $i$ will never get rejected when $P_{i}>\tau$ (by Definition 2).

We make the above intuition rigorous using a leave-one-out argument as in the proof idea of Li and Barber [2019]. Let us first pay attention to a single index $i\in[m]$ . We denote by $k_{i}$ the number of discoveries of IHW-BH if $\mathbf{P}$ gets replaced by $\mathbf{P}_{i\mapsto 0}=(P_{1},\dotsc,P_{i-1},0,P_{i+1},P_{m})$ . Note that because the weights are $\tau$ -censored (Specification 2) , the tuple of weights $\mathbf{W}$ remains unchanged by replacing $P_{i}$ by [math] on the event $\{P_{i}\leq\tau\}$ . Furthermore, by definition of the $\tau$ -censored weighted BH procedure (Definition 2), the rejection of $H_{i}$ (by IHW-BH applied to $\mathbf{P}$ ) implies that $P_{i}\leq\left(\frac{\alpha W_{i}\hat{k}}{m}\right)\land\tau$ . In particular, the event $\{P_{i}\leq\tau\}$ holds. Furthermore, for any $k\geq\hat{k}$ , counting the entries of $\mathbf{P}$ , respectively $\mathbf{P}_{i\mapsto 0}$ , that are not greater than the corresponding entries of $\left(\frac{\alpha\mathbf{W}k}{m}\right)\land\tau$ must yield the same number. We conclude that:

[TABLE]

Therefore,

[TABLE]

Note at this point that we can assume without loss of generality that $\mathbb{P}[P_{i}\leq\tau]>0$ for all $i\in\mathscr{H}_{0}$ . Otherwise, just set $\mathscr{H}_{0}^{\prime}=\{i\in\mathscr{H}_{0}\mid\mathbb{P}[P_{i}\leq\tau]>0\}$ and all the steps below will go through essentially unchanged with $\mathscr{H}_{0}^{\prime}$ replacing $\mathscr{H}_{0}$ . For $i\in\mathscr{H}_{0}$ and conditioning on the event $\{P_{i}\leq\tau\}$ and on the random vectors $\mathbf{W},\mathbf{X},\mathbf{P}_{i\mapsto 0},\mathbf{1}(\mathbf{P}\leq\tau)$ , we get

[TABLE]

This follows because for $i\in\mathscr{H}_{0}$ it holds that $P_{i}$ is super-uniform, $\mathbb{P}[P_{i}\leq\tau]>0$ and $P_{i}$ is independent of $(\mathbf{P}_{i\mapsto 0},\mathbf{X})$ and also because $k_{i}$ , $\mathbf{W}$ , $\mathbf{1}(\mathbf{P}\leq\tau)$ are functions of $(\mathbf{P}_{i\mapsto 0},\mathbf{X})$ on the event $\{P_{i}\leq\tau\}$ . It then follows that

[TABLE]

Moreover, by marginalization over $\mathbf{P}_{i\mapsto 0}$ and $\mathbf{X}$ (and noting again that $\mathbf{1}(H_{i}\text{ rejected})=0$ when $\mathbf{1}(P_{i}\leq\tau)=0$ ),

[TABLE]

In total, we thus get

[TABLE]

At this point we diverge from the proof of Li and Barber [2019] and take advantage of honesty (Specification 1) through Lemma 1.

[TABLE]

Going from the second to the third line, we used that for $i\in\mathscr{H}_{0}$ , $P_{i}$ is independent of $W_{i}$ , which holds from Lemma 1(a’). In the last step, we used Part (b) of the Honesty specification.

∎

S1.3 Counterexample to demonstrate that honesty of weights does not suffice for FDR control (due to M. Sklar)

In this section, we provide a counterexample that the result of Theorem 1 no longer holds if we drop the assumption of $\tau$ -censored weighting. This is in contrast e.g., to the conclusion of Theorem 3 for $k$ -Bonferroni, wherein honesty of the weights suffices (along with distributional assumptions on $((P_{i},X_{i}))_{i\in[m]}$ ).

Our agenda is as follows: for $m=4$ , we construct $((P_{i},X_{i}))_{i\in[m]}$ under the global null such that Assumption 1 holds. Then we construct honest weights $W_{i}$ (Specification 1) and finally we apply the weighted BH procedure at level $\alpha\in(0,1)$ (Definition 2 with $\tau=1$ ) with p-values $P_{i}$ and weights $W_{i}$ . We will show this procedure does not control the $\operatorname{FDR}$ at the nominal level.

We observe four independent and uniform (null) p-values $P_{1},P_{2},P_{3},P_{4}$ . Our covariates take values $X_{i}=i$ . We partition the hypotheses into the folds $\left\{1,2\right\}$ and $\left\{3,4\right\}$ . The (adversarial) honest weighting scheme is as follows: If $\frac{\alpha}{2}\leq P_{1}\leq\alpha$ , assign $W_{3}=2,W_{4}=0$ . Otherwise assign $W_{3}=0,W_{4}=2$ . Similarly if $\frac{\alpha}{2}\leq P_{3}\leq\alpha$ , then assign $W_{1}=2,W_{2}=0$ and otherwise $W_{1}=0,W_{2}=2$ . These weights are honest; note that $W_{i}\geq 0$ for all $i$ , $\sum_{i=1}^{4}W_{i}=4$ .

To study the $\operatorname{FDR}$ of this procedure we partition the sample space according to the four possibilities for the weight assignment. Also note that due to the weighting scheme in the end we will be applying unweighted Benjamini-Hochberg to two hypotheses at level $\alpha$ . For notational convenience we will write $\text{BH}(P_{i},P_{j})$ for the event that BH applied to $P_{i},P_{j}$ at level $\alpha$ rejects at least one of these two p-values.

Case 1: Here we have $W_{2}=W_{4}=2$ and $W_{1}=W_{3}=0$ . Thus we are just doing unweighted Benjamini-Hochberg on the p-values $P_{2}$ and $P_{4}$ . Noting that occurence of this case depends only on $P_{1},P_{3}$ , we get by independence:

[TABLE]

Case 2: Now consider $W_{1}=W_{3}=2$ and $W_{2}=W_{4}=0$ . In this case, we know that both $\frac{\alpha}{2}\leq P_{1}\leq\alpha$ and $\frac{\alpha}{2}\leq P_{3}\leq\alpha$ . These in turn imply that $\text{BH}(P_{1},P_{3})$ also holds (in fact BH rejects both hypotheses). Thus:

[TABLE]

Case 3: Now let $W_{1}=W_{4}=2$ and $W_{2}=W_{3}=0$ . Then:

[TABLE]

The latter is true since if $P_{1}\not\in[\frac{\alpha}{2},\alpha]$ , the only way BH will reject is if $P_{1}<\frac{\alpha}{2}$ or $P_{4}\leq\frac{\alpha}{2}$ . Hence the event on the RHS can be written as the disjoint union of $\{P_{1}<\alpha/2\}$ and $\{P_{4}\leq\alpha/2,P_{1}>\alpha\}$ .

Case 4: By symmetry with Case 3, this contributes the same probability.

Summing up all 4 cases, we see that

[TABLE]

Hence $\operatorname{FDR}$ is not controlled at the nominal level $\alpha$ .

S1.4 The IHW-Storey procedure under independence: Proof of Theorem 2

Proof.

Take $i\in I_{\ell}\cap\mathscr{H}_{0}$ and define the leave-one-out null proportion estimator (compare to Equation (5)):

[TABLE]

Now note that on the event $\{P_{i}\leq\tau\}$ (since $\tau^{\prime}\geq\tau$ ) we have that:

[TABLE]

Next, define

[TABLE]

(16) implies that running the $\tau$ -censored, weighed BH procedure (Definition 2) with p-values $P_{i}$ and weights $W_{i}/\hat{\pi}_{0,I_{\ell}}$ (i.e., the procedure whose $\operatorname{FDR}$ control we seek to prove) will have identical rejections if we replace the weights by $\widetilde{W}_{i}$ . Hence we turn to study the procedure with weights $\widetilde{W}_{i}$ . Proceeding as in the leave-one-out argument of the proof of Theorem 1 we get

[TABLE]

In fact, since $\hat{\pi}_{0,I_{\ell}}^{-i}$ does not depend on $P_{i}$ (it depends on $\mathbf{P}_{i\mapsto 0})$ , all arguments of the proof of Theorem 1 go through unchanged with $\widetilde{W}_{i}$ replacing $W_{i}$ . The only step we need to pay attention to is the last line: it no longer holds that

[TABLE]

Indeed we are hoping that this sum is greater than $m$ so that we can gain power by the null-proportion adaptivity. Instead, it suffices to argue that

[TABLE]

And hence it also suffices to prove that for each fold $\ell$ the following holds

[TABLE]

To prove this, we first recall from Lemma 1(a’) that

[TABLE]

For notational convenience we write $\mathbf{W}_{\mathscr{H}_{0}\cap I_{\ell}}$ for $(W_{i})_{i\in\mathscr{H}_{0}\cap I_{\ell}}$ . Then:

[TABLE]

In the penultimate line we used the Inverse Binomial Lemma (Lemma 3 in Ramdas et al. [2019]), noting that conditionally on $\mathbf{W}_{\mathscr{H}_{0}\cap I_{\ell}}$ , the weights in folds $\ell$ may be treated as deterministic and by Lemma 1(b’) the p-values $(P_{i})_{i\in\mathscr{H}_{0}\cap I_{\ell}}$ are jointly independent and super-uniform. We conclude our proof by iterated expectation and summing over $i\in\mathscr{H}_{0}\cap I_{\ell}$ . ∎

S1.5 The IHW-BY procedure under dependence: Proof of Theorem 4

Proof.

We will equivalently prove that applying the weighted Benjamini-Hochberg procedure (without censoring, i.e., $\tau=1$ ) at level $\alpha$ controls the $\operatorname{FDR}$ at level $\alpha\sum_{k=1}^{m}\frac{1}{k}$ .

For a probability measure $\nu$ on $\mathbb{R}^{+}$ , we define the reshaping function $\tilde{\beta}:\mathbb{R}^{+}\to\mathbb{R}^{+}$ [Blanchard and Roquain, 2008, Ramdas et al., 2019]:

[TABLE]

Furthermore, let $\hat{k}$ be the number of rejections of the IHW-BH procedure applied at level $\alpha$ . Then for arbitrary $c>0$ , $i\in\mathscr{H}_{0}$ and on the event $\{W_{i}>0\}$ :

[TABLE]

The inequality follows from Lemma 3.2. (iii) in Blanchard and Roquain [2008] (also Lemma 1(c) in Ramdas et al. [2019]), which we reproduce in a slightly modified form here for the reader’s convenience:

Lemma 2.

Let $U$ a super-uniform random variable and $S>0$ another random variable, then for all fixed $t>0$ :

[TABLE]

We recover (17) by applying Lemma 2 conditionally on $W_{i}$ with $U=P_{i}$ , $S=\hat{k}\lor 1$ and $t=\frac{c\alpha W_{i}}{m}$ . To do so, note that we may treat $\frac{c\alpha W_{i}}{m}$ as a constant conditionally on $W_{i}$ and that $P_{i}\mid W_{i}$ is super-uniform, since $P_{i}\perp W_{i}$ by Lemma 1(a) and $P_{i}$ is unconditionally super-uniform.

Inequality (17) also holds true almost surely on the event $\{W_{i}=0\}$ , as the distribution of $P_{i}\mid W_{i}$ cannot have a point mass at [math], since this would contradict super-uniformity. Thus we also get unconditionally that

[TABLE]

Now, consider the special case in which we use the measure $\nu(x)=\frac{1}{\sum_{k=1}^{m}\frac{1}{k}}\sum_{k=1}^{m}\frac{1}{k}\delta_{k}(x)$ , where $\delta_{k}$ is the point mass at $k$ . Then the reshaping function takes the form $\tilde{\beta}(r)=\frac{r}{\sum_{k=1}^{m}\frac{1}{k}}$ for $r\in\mathbb{N}_{\geq 0}$ . Applying the above result with this $\tilde{\beta}$ and $c=\sum_{k=1}^{m}\frac{1}{k}$ we get

[TABLE]

We conclude by using that $\sum_{i=1}^{m}W_{i}=m\text{ almost surely}$ as follows

[TABLE]

Note that the above proof extends to applying the weighted BH procedure with arbitrary reshaping function $\tilde{\beta}$ as in Blanchard and Roquain [2008], Ramdas et al. [2019].

∎

S1.6 Counterexample to demonstrate that BY with $\tau$ -censored data-driven weights does not control $\operatorname{FDR}$

For our counterexample, we consider the following $\tau$ -censored way of assigning data-driven weights: assign weight $W_{i}=0$ to all hypotheses with $p$ -value greater than $\tau$ and distribute the remaining weight equally across all hypotheses with p-value $\leq\tau$ in any given fold. This weighting procedure satisfies $\tau$ -censoring (Specification 2) as it only uses whether a p-value is below or above $\tau$ ; however it does not satisfy honesty (Specification 1). Finally, we apply the weighted Benjamini-Yekutieli procedure with p-values $P_{i}$ and weights $W_{i}$ .

Proof.

The result for this counterexample depends on $m,\tau,\alpha$ . We make the following simplifying assumptions on these: first, to avoid issues with rounding, we assume that $\tau\in\mathbb{Q}$ and $m$ is such that $m\cdot\tau\in\mathbb{N}$ . Furthermore, we assume that $\alpha\leq\tau$ .101010 $\operatorname{FDR}$ control is also violated when $\alpha>\tau$ : just replace $\tau$ by $\max\left\{\alpha,\tau\right\}$ in the following arguments.

Below, we will construct a joint distribution on $((P_{i},X_{i}))_{i\in[m]}$ such that Assumption 2 holds with one fold, i.e., $K=1$ and $I_{1}=[m]$ . We discuss the case of two independent folds at the end of the proof.

First, we draw $X_{i}\,{\buildrel\text{iid}\over{\sim}\,}U[0,1]$ and independent of the p-values $(P_{i})_{i\in[m]}$ . The joint distribution of the p-values is constructed (details below) so that exactly $m\tau$ p-values are $\leq\tau$ . This means that the $m\tau$ hypotheses with p-value $\leq\tau$ are assigned weights $m/(m\tau)=1/\tau$ and so letting $\alpha_{BY}=\alpha\big{/}\sum_{j=1}^{m}\frac{1}{j}$ , then weighted BY will reject at least $k$ hypotheses if:

[TABLE]

Next we define $q_{k}=\frac{\alpha_{BY}\cdot k}{m\tau},k\geq 1$ , $q_{0}=0$ . Then weighted BY will make at least $k$ rejections (for $k\leq m\tau$ ) if:

[TABLE]

It remains to provide a distribution on p-values $(P_{1},\dotsc,P_{m})$ such that Assumption 2 is satisfied and such that (18) with $k\geq 1$ occurs frequently enough so that $\operatorname{FDR}$ control is violated. To this end, we generate the $m$ p-values hierarchically111111Our construction is a modification of an unpublished proof of the worst-case behavior of BH under dependence by Emmanuel Candès and Rina Foygel Barber. This proof has appeared in the STATS300C lecture notes of Emmanuel Candès, available at https://statweb.stanford.edu/~candes/teaching/stats300c/. as follows:

We draw a set of indices $\mathcal{T}\subset\{1,\dotsc,m\}$ of cardinality $m\tau$ uniformly at random from $\{1,\dotsc,m\}$ . 2. 2.

For $i\notin\mathcal{T}$ , we draw $P_{i}\sim U[\tau,1]$ . 3. 3.

For $i\in\mathcal{T}$ , we instead proceed as follows:

(a)

We draw $\tilde{\kappa}\in\{0,\dotsc,m\tau\}$ from the following distribution:

[TABLE] 2. (b)

We draw a set of indices $\mathcal{S}\subset\mathcal{T}$ of cardinality $\tilde{\kappa}$ uniformly at random from $\mathcal{T}$ . 3. (c)

For $i\in\mathcal{S}$ , we draw $P_{i}\sim U[q_{\tilde{\kappa}-1},q_{\tilde{\kappa}}]$ . 4. (d)

For $i\in\mathcal{T}\setminus\mathcal{S}$ , we draw $P_{i}\sim U[\alpha_{BY},\tau]$ .

Let us note that when $\tilde{\kappa}\geq 1$ , then $\left\lvert\mathcal{S}\right\rvert=\tilde{\kappa}$ and so there will be $\tilde{\kappa}$ p-values in the interval $U[q_{\tilde{\kappa}-1},q_{\tilde{\kappa}}]$ , and so by (18) these p-values will be rejected leading to a $\operatorname{FDP}$ equal to $1$ . The only situation in which we will make no rejections is on the event that $\tilde{\kappa}=0$ and so $\operatorname{FDP}\geq\mathbf{1}(\tilde{\kappa}\geq 1)$ . Thus:

[TABLE]

Note that for large enough $m$ this approaches $\alpha/\tau$ and so indeed, for $\tau<1$ , $\operatorname{FDR}$ is not controlled.

There remains one step to conclude the proof: we need to check that the p-values generated above indeed are all (marginally) uniform. Fix an arbitrary $i\in\{1,\dotsc,m\}$ . Note that conditionally on $\tilde{\kappa},\mathcal{S},\mathcal{T}$ , the distribution of the p-value $P_{i}$ is as follows:

[TABLE]

Let us compute the probabilities of the events above:

[TABLE]

This means that:

[TABLE]

This is precisely the uniform distribution, i.e., $P_{i}\sim U[0,1]$ .

Let us finally conclude by discussing how to extend this construction to the case of two independent folds. Let $m=2m^{\prime}$ for $m^{\prime}\in\mathbb{N}$ and assume that $m^{\prime}\tau\in\mathbb{N}$ . Let us take the two folds to be $I_{1}=[m^{\prime}]$ and $I_{2}=[m]\setminus[m^{\prime}]$ . We may apply the construction above independently to each fold. Now let $A_{\ell}$ , $\ell\in\left\{1,2\right\}$ be the event that BY rejects at least one hypothesis in fold $\ell$ , even after setting the p-values in the other fold to $1$ . Then repeating the arguments leading up to (19), we find that $\mathbb{P}[A_{\ell}]\geq\alpha^{\prime}/2$ , where $\alpha^{\prime}:=\alpha/\tau\cdot\log(m^{\prime}\tau+1)/(\log(2m^{\prime})+1)$ . Since $\operatorname{FDR}\geq\mathbb{P}[A_{1}\cup A_{2}]$ and the events $A_{1}$ and $A_{2}$ are independent, we find that $\operatorname{FDR}\geq\alpha^{\prime}/(\alpha^{\prime}+1)$ . This is strictly larger than $\alpha$ , for example when $\tau<1$ , $m^{\prime}$ is large and $\alpha$ is small.

∎

Supplement S2: Proofs for IHW-BH asymptotics

For our asymptotics, we make the following regularity assumption:

Assumption 3 (Regularity of conditional two-groups model).

The conditional two-groups model (6) satisfies:

(a)

$F_{\text{alt}}(t\mid X_{i}=x)$ is $L(x)$ -Lipschitz continuous in $t$ for all $x\in\mathcal{X}$ , i.e.,

[TABLE]

$L(\cdot)$ satisfies $\int{L^{2}(x)d\mathbb{P}^{X}(x)}<\infty$ and furthermore $F_{\text{alt}}(0\mid X_{i}=x)=0$ for all $x$ . 2. (b)

$F_{\text{alt}}(t\mid X_{i}=x)$ is strictly concave in $t$ for all $x$ . 3. (c)

There exists $t^{\prime}\in(0,1]$ such that $\frac{t^{\prime}}{F(t^{\prime}\mid X_{i}=x)}\leq\alpha^{\prime}$ for an $\alpha^{\prime}<\alpha$ and for all $x\in\mathcal{X}$ .

Part (a) is a technical assumption restricting the smoothness of $F_{\text{alt}}(\cdot\mid X_{i}=x)$ ; it allows for the smoothness to vary as $x\in\mathcal{X}$ varies. Part (b) is a common assumption in multiple testing; see also the discussion and references in Section 4. The assumption (in the setting without covariates) appears for example in Lemma 1 and Theorem 2 of Genovese et al. [2006]. Part (c) is also an assumption made for $\operatorname{FDR}$ asymptotics without covariates (e.g., it appears in Theorem 4 of Storey, Taylor, and Siegmund [2004]). It is, however, less innocuous than Parts (a,b); for example it excludes the global null and the case $\pi_{0}(x)=1$ .

Some remarks on notation:

In this section we use a different typeface for the weight function, i.e., we write $\mathscr{W}:\mathcal{X}\to\mathbb{R}_{\geq 0}$ and $\hat{\mathscr{W}}^{(I)}$ for the weight function learned based on data $((P_{i},X_{i}))_{i\in I}$ . This ensures that the notation is unambiguous and not conflicting with the notation used in Supplement S1 for finite-sample results. We also use the notation $a_{m}=o(1)$ for a deterministic sequence $a_{m}$ satisfying $a_{m}\to 0,\text{ as }m\to\infty$ and $Z_{m}=o_{\mathbb{P}}(1)$ for a sequence of random variables $Z_{m}$ that converge to [math] in probability as $m\to\infty$ .

S2.1 Proof of Proposition 1(a)

Proof.

We first make a few assumptions on the data-generating mechanism (while making sure that Assumption 3 still holds): We first assume that $\mathcal{X},\mathbb{P}^{X}$ are such that $X_{1},\dotsc,X_{n}$ are all unequal with probability $1$ ; this is true for example when $\mathbb{P}^{X}$ is absolutely continuous w.r.t. the Lebesgue measure on $\mathbb{R}^{p}$ . Next we assume that for $\pi_{1}(x)=1-\pi_{0}(x)$ it holds that $\mathbb{E}\left[\pi_{1}(X_{i})\right]<\delta$ for some $\delta>0$ ; i.e., there are not too many alternative hypotheses. Finally we assume that we run weighted BH at $\alpha\in(0,1/2)$ .

Our application of naive weighted BH is as follows: We let $k_{m}=\lfloor\alpha m/2\rfloor$ and $\mathcal{J}_{m}$ the index set of $k_{m}$ hypotheses with smallest p-values. We will assign weight $m/k_{m}$ to these and all other hypotheses will receive weight [math]. This is equivalent to applying BH directly to the $k_{m}$ smallest p-values (while ignoring their selection).

Formally, in terms of Specification 7, the weighting function takes the form:

[TABLE]

This satisfies the conditions of Specification 7: $\int\hat{\mathscr{W}}^{([m])}(x)d\mathbb{P}^{X}(x)=1$ almost surely for all $m$ and second, $\sup_{x\in\mathcal{X}}\hat{\mathscr{W}}^{([m])}(x)=m/k_{m}=m/\lfloor\alpha m/2\rfloor\leq 4/\alpha$ as soon as $\alpha m\geq 2$ , which is stronger than the requirement on the growth of $\int\hat{\mathscr{W}}^{([m])}(x)^{2}d\mathbb{P}^{X}(x)$ in (7) (this is a formal verification; condition (7) pertains to the out-of-sample behavior of the weighting function with respect to a fresh draw $X_{i}\sim\mathbb{P}^{X}$ ).

Writing $P_{(1)}\leq P_{(2)}\leq\dotsc\leq P_{(m)}$ for the order statistics of $P_{1},\dotsc,P_{m}$ , consider the events $A_{m}=\left\{P_{(k_{m})}\leq\alpha\right\}$ and $B_{m}=\left\{\sum_{i=1}^{m}H_{i}\leq 1.1\cdot\delta\cdot m\right\}$ .

On the event $A_{m}$ , weighted BH will reject all hypotheses in $\mathcal{J}_{m}$ , since by definition of $A_{m}$ it holds that $P_{(k_{m})}\leq\alpha=(k_{m}\cdot\alpha/m)\cdot(m/k_{m})=(k_{m}\cdot\alpha/m)\cdot W_{(k_{m})}$ . On the other hand, on the event $B_{m}$ , there will be at least $k_{m}-1.1\cdot\delta\cdot m$ false rejections (i.e., all rejections minus an upper bound on the number of alternative hypotheses). Thus on $A_{m}\cap B_{m}$ and for large enough $m$ (we slightly enlarge $2.2=2\cdot 1.1$ to $2.3$ to account for rounding in the definition of $k_{m}$ ):

[TABLE]

We will next argue that $\mathbb{P}\left[A_{m}\right],\mathbb{P}\left[B_{m}\right]\to 1$ as $m\to\infty$ and thus:

[TABLE]

The latter will in general be $>\alpha$ for small enough $\delta$ , so that naive weighted BH does not control $\operatorname{FDR}$ .

Let us prove the claims for $A_{m}$ and $B_{m}$ . For $B_{m}$ , the result follows by noting that $\sum_{i=1}^{m}H_{i}\sim\text{Binomial}(m,\mathbb{E}{\pi_{1}(X_{i})})$ , as well as an application of Chernoff’s bound. For $A_{m}$ , we note that by Assumption 3(b), it follows that $P_{(k_{m})}$ is stochastically smaller than $\tilde{P}_{(k_{m})}$ , defined as the $k_{m}$ -th smallest order statistic of a sample of $m$ i.i.d. uniform random variables $\tilde{P}_{1},\dotsc,\tilde{P}_{m}$ . Note that $\tilde{P}_{(k_{m})}$ is distributed as $\text{Beta}(k_{m},m+1-k_{m})$ which has expectation $k_{m}/(m+1)\leq\frac{\alpha}{2}$ . Hence:

[TABLE]

The last convergence follows from concentration of a Beta random variable (say, by an application of Chebyshev’s inequality.)

∎

S2.2 Proof of Proposition 1(b)

Proof.

We first give a sketch of the proof:

Analysis for a single fold and a deterministic weighting function: This serves as a warm-up. The analysis is very similar to asymptotics e.g., in Storey et al. [2004], adapted to the setting with covariates and a weighting function. 2. 2.

Analysis for a single fold with data-driven weighting function learned out-of-fold: Here we refine the analysis from Step 1 to account for the data-driven nature of the weighting function. The fundamental nature of the arguments however is the same as in Step 1. 3. 3.

Aggregating results across folds: We give an equivalent formulation of the IHW-BH rejection rule in terms of empirical processes. Then, by combining results shown in Step 2, we demonstrate FDR control.

Single fold, deterministic weighting function:

We first study a single fold, say $I=I_{\ell}$ (that grows with $m$ ), and a deterministic weighting function with the following properties:

[TABLE]

We introduce notation for processes indexed by a threshold $t\in[0,1]$ , the weighting function $\mathscr{W}$ and the set $I\subset[m]$ indexing the hypotheses in the single fold under study.

[TABLE]

The goal will be to relate the empirical processes to their population counterparts through uniform (in $t$ ) laws of large numbers. We require one more definition to account for normalization of weights so that $\sum_{i\in I}W_{i}=\left\lvert I\right\rvert$

[TABLE]

Next pick a deterministic sequence $0<\varepsilon_{m}=o(1)\text{ as }m\to\infty$ such that $\mathbb{P}\left[\left\lvert\hat{c}_{I,\mathscr{W}}-1\right\rvert>\varepsilon_{m}\right]=o(1)$ ; such a sequence exists by the law of large numbers. Then for a $o_{\mathbb{P}}(1)$ term that is uniform in $t\in[0,1]$ , it holds that:

[TABLE]

$(i)$ follows by monotonicity of $R(t,\mathscr{W};I)$ in $t$ . $(ii)$ follows from the Glivenko-Cantelli theorem applied to the i.i.d. $P_{i}/\mathscr{W}(X_{i})$ 121212We set the above to $\infty$ if $\mathscr{W}(X_{i})=0$ .. $(iii)$ follows from Assumption 3(a), as follows: first note that $F(t\mid X_{i}=x)$ must be $\max\left\{1,L(x)\right\}$ Lipschitz in $t$ as it is a convex combination of a $L(x)$ -Lipschitz function and a $1$ -Lipschitz function (the identity). Next

[TABLE]

Applying the same argument in the reverse direction we also get for the same (uniform in $t$ ) $o_{\mathbb{P}}(1)$ term:

[TABLE]

Combining the two results, noting that $R(t,\hat{c}_{\mathscr{W}}\cdot\mathscr{W};I)=R(\hat{c}_{\mathscr{W}}\cdot t,\mathscr{W};I)$ and by choice of $\varepsilon_{m}$ we conclude that:

[TABLE]

We may analogously prove that:

[TABLE]

Also note that:

[TABLE]

It also deterministically holds that $F_{0}(t,\mathscr{W})\leq F_{0}^{\text{BH}}(t,\mathscr{W})$ for all $t,\mathscr{W}$ and so

[TABLE]

Single fold, data-driven weighting function:

Above we worked with a deterministic weighting function $\mathscr{W}$ . However, for IHW we use the weighting function $\hat{\mathscr{W}}^{([m]\setminus I)}$ learned out-of-fold. It turns out that the conclusions hold verbatim, i.e.,

[TABLE]

To adapt the proof for deterministic $\mathscr{W}$ to a proof for data-driven $\hat{\mathscr{W}}^{([m]\setminus I)}$ (where $\hat{\mathscr{W}}^{([m]\setminus I)}$ depends on data outside of fold $I=I_{\ell}$ , cf. Specification 7) we make the following observations:

We conduct the analysis conditionally on data in the other folds $\mathcal{D}_{[m]\setminus I}=((P_{i},X_{i},H_{i}))_{i\in[m]\setminus I}$ . For example, to show the first result in (LABEL:eq:r_glivenko_cantelli_loo) it suffices to show (see arguments below) that for a sequence $\eta_{m}\to 0$ :

[TABLE]

Such a conditional convergence statement also implies unconditional convergence (cf. Lemma 6.1. in Chernozhukov et al. [2017]), i.e.,

[TABLE]

The first result in (LABEL:eq:r_glivenko_cantelli_loo) then follows. 2. 2.

It can be assumed without loss of generality that $\int\hat{\mathscr{W}}^{([m]\setminus I)}(x)d\mathbb{P}^{X}(x)=1$ for all $m$ ; otherwise we may redefine the weight function as $\hat{\mathscr{W}}^{([m]\setminus I)}/\int\hat{\mathscr{W}}^{([m]\setminus I)}(x)d\mathbb{P}^{X}(x)$ . This is only a formal modification; the IHW-BH procedure applied remains the same, as the weights will subsequently be rescaled to sum to $\left\lvert I\right\rvert$ in fold $I$ (this is captured here by the multiplication with $\hat{c}_{I,\hat{\mathscr{W}}^{([m]\setminus I)}}$ ). 3. 3.

To establish (28), the argument used for a deterministic weighting function (e.g., in (22)) applies as long as we pay attention to controlling the two probabilistically negligible terms. In particular, we need to check that for (deterministic sequences) $\eta_{m}^{\prime},\eta_{m}^{\prime\prime}=o(1)$ that

[TABLE]

and

[TABLE]

In the deterministic case, the corresponding results were a consequence of the law of large numbers, respectively the Glivenko-Cantelli theorem. In the conditional case we may establish these results directly. For the first one we note that by Chebyshev’s inequality (conditionally on $\mathcal{D}_{[m]\setminus I}$ ) it holds almost surely for any $\delta>0$ that

[TABLE]

The conclusion follows. For the second result, we may replace the Glivenko-Cantelli theorem by an application of the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality conditionally on $\mathcal{D}_{[m]\setminus I}$ .

Aggregating results across folds:

Let us introduce some additional notation.

[TABLE]

This implies (the denominator of $\widehat{\operatorname{FDP}}^{\text{IHW}}(t)$ is right-continuous and piecewise constant in $t$ with jumps, while the numerator is continuous) that

[TABLE]

The quantities allow us to express IHW-BH from an empirical process viewpoint (cf. Storey et al. [2004])

[TABLE]

Henceforth we make the additional assumption that $\alpha^{\prime}$ in Assumption 3(c) further satisfies $\alpha^{\prime}<\alpha/2$ ; this simplifies the step below but the proof goes through also for $\alpha^{\prime}<\alpha$ . With this simplification, we next argue that, for $t^{\prime\prime}=t^{\prime}/(4\Gamma)$ with $t^{\prime}$ defined in Assumption 3(c)

[TABLE]

Note that Assumption 3 implies that $F(t|X_{i}=x)\geq t$ for all $t\in(0,1)$ . Fixing a weighting function $\mathscr{W}$ as in (20)

[TABLE]

In the penultimate step we used the Paley–Zygmund inequality. This lower bound holds uniformly over weighting functions satisfying (20). In the current setting, this implies that for a constant $c>0$

[TABLE]

In conjunction with (LABEL:eq:r_glivenko_cantelli_loo) this yields (we need the preceding claim to make sure the denominators in the expression below do not vanish)

[TABLE]

Fixing again a $\mathscr{W}$ as in (20), we find using Cauchy-Schwarz and Markov’s inequality, that

[TABLE]

Next, we find that:

[TABLE]

Step $(i)$ follows from (35), step $(ii)$ follows by the definition of $\alpha^{\prime}$ in Assumption 3(c) and $(iii)$ from concavity of $F(\cdot\mid x)$ and the fact that $\mathscr{W}(x)/(4\Gamma)\leq 1$ on the stated event. Next, rearranging (36)

[TABLE]

Since this holds for an arbitrary weighting function (20), we also get that

[TABLE]

And so along with (34) we see that (33) holds:

[TABLE]

We are almost ready to prove $\operatorname{FDR}$ control. By (32), we see that the rejections of IHW-BH are precisely equal to:

[TABLE]

The false rejections are equal to:

[TABLE]

So:

[TABLE]

In the last step we used (31). We carry along the constraint $\mathbf{1}(\hat{t}^{\text{IHW}}\geq t^{\prime\prime})$ to emphasize that the denominator divided by $m$ will remain bounded from below with probability converging to $1$ . We conclude with the dominated convergence theorem ( $\text{FDR}^{\text{IHW}}=\mathbb{E}\left[\operatorname{FDP}^{\text{IHW}}\right]$ and $\operatorname{FDP}^{\text{IHW}}\in[0,1]$ ) that

[TABLE]

∎

S2.3 Proof of Proposition 1(c)

Proof.

Let us introduce the asymptotic threshold of both procedures;

[TABLE]

Assumption 3 ensures the existences of unique $t^{*}(\mathscr{W}^{*})\in(t^{\prime\prime},1)$ for which in fact equality is attained, i.e., $F_{0}^{\text{BH}}(t^{*}(\mathscr{W}^{*}),\mathscr{W}^{*})/F(t^{*}(\mathscr{W}^{*}),\mathscr{W}^{*})=\alpha$ .

We use (11) as our definition for power; the results for other notions such as $1-\text{FNR}$ where FNR is the false nondiscovery rate are analogous. Our claim is that the power of both naive weighted BH and IHW-BH asymptotically is equal to

[TABLE]

We start by analyzing Naive weighted BH. First, we may use continuity and Glivenko Cantelli arguments leading to (23) and (25), along with the assumption on uniform convergence (in probability) of $\hat{\mathscr{W}}^{([m])}$ , to show that

[TABLE]

The empirical process interpretation of naive weighted BH (analogous to (32) for IHW-BH) is as follows. Define

[TABLE]

Then the naive weighted BH procedure rejection rule takes the following form;

[TABLE]

Our next step is to show that $\hat{t}^{\text{Naive}}=t^{*}(\mathscr{W}^{*})+o_{\mathbb{P}}(1)$ . Fix any $\delta\in(0,t^{*}(\mathscr{W}^{*}))$ , then using Assumption 3 and (LABEL:eq:r_glivenko_cantelli_naive) we deduce that:

[TABLE]

Then, another application of (LABEL:eq:r_glivenko_cantelli_naive) and continuity properties of $F(\cdot\mid X_{i}=x)$ demonstrates that:

[TABLE]

By the law of large numbers: $\sum_{i=1}^{m}H_{i}/m=\int(1-\pi_{0}(x))d\mathbb{P}^{X}(x)+o_{\mathbb{P}}(1)$ . By definition,

[TABLE]

and so by dominated convergence (note the term within the expectation above is in $[0,1]$ ):

[TABLE]

The same argument also applies for IHW-BH, leveraging results proved already in part (b) of the proposition. In particular it follows as for naive weighted BH that also $\hat{t}^{\text{IHW}}=t^{*}(\mathscr{W}^{*})+o_{\mathbb{P}}(1)$ by using (LABEL:eq:r_glivenko_cantelli_loo) and Lipschitz continuity of $F(\cdot\mid x)$ . We also note in passing that under the assumptions of part (c), we could have omitted the conditional analysis required in part (b) to prove (LABEL:eq:r_glivenko_cantelli_loo). Instead, a more direct argument (along the lines of (22)) could be given by noting that the i.i.d. structure and convergence of the weighting mechanism imply that

[TABLE]

Cross-weighting derives its flexibility from guarantees established in part (b), that hold even if the above convergence property of the learned weight function does not hold. ∎

S2.4 Proof of Corollary 2

Let $\nu(x)=\mathbb{P}[X_{i}=x]$ for $x\in[G]$ . We assume without loss of generality that $\nu(x)>0$ for all $x\in[G]$ ; otherwise it suffices to restrict the covariate space to $[G]\setminus\{x\}$ .

In the setting with a categorical covariate, (7) is automatically satisfied for any weighting function $\mathscr{W}(x)\geq 0$ . To see this, first note that $\int\mathscr{W}(x)d\mathbb{P}^{X}(x)\geq\max_{x\in[G]}\left\{\mathscr{W}(x)\nu(x)\right\}$ . It also holds that

[TABLE]

Thus, (7) holds with $\Gamma=G\;\big{/}\min_{x\in[G]}\left\{\nu(x)\right\}$ . It remains to check part (c) of Proposition 1. Recall from Algorithms 1, 2, that the weighting rules take the form, for $x\in[G]$ :

[TABLE]

We need to exhibit the weighting function towards which the aforementioned weighting function converges under Assumption 3. To this end, let us note that:

[TABLE]

Next, define:

[TABLE]

Note that for $\tau\in(0,1)$ , by assumptions, $\pi_{0}(x)<1$ and $F_{\text{alt}}(\tau\,\big{|}\,x)>\tau$ and so $\pi^{*}_{0}(x)<1$ . To avoid dealing with the (unlikely in multiple testing applications) situation that $\pi^{*}_{0}(x)=0$ we further assume that either $\pi_{0}(x)>0$ (i.e., there are at least some null hypotheses) or $F_{\text{alt}}(\tau\,\big{|}\,x)<1$ . Thus henceforth we assume that $\pi^{*}_{0}(x)\in(0,1)$ .

The asymptotic weight function is $W^{*}(x)$ defined as

[TABLE]

Notice that indeed $\int W^{*}(x)d\mathbb{P}^{X}(x)=\sum_{g=1}^{G}\nu(g)W^{*}(g)=1$ . By application of the law of large numbers and the continuous mapping theorem, we may deduce that:

[TABLE]

We may conclude by noting that $\mathcal{X}=[G]$ is finite, and so $\widehat{W}(x)=W^{*}(x)+o_{\mathbb{P}}(1)$ for all $x\in[G]$ implies that

[TABLE]

Supplement S3: Multiple testing with local false discovery rates

Consider the conditional two-groups model (6) and assume that $F(t\mid x)$ has Lebesgue density $f(t\mid x)$ for all $x$ . Then define the conditional local fdr:

[TABLE]

We make two observations: First, for any threshold function $s:\mathcal{X}\to[0,1]$ , one may show that

[TABLE]

Equation (44) implies that we can estimate the mFDR of a procedure with decision threshold $s$ (i.e., of the procedure that rejects hypotheses that satisfy $P_{i}\leq s(X_{i})$ ) by

[TABLE]

Second, optimality considerations for multiple testing under model (6) dictate that hypotheses should be ranked by $\operatorname{fdr}(P_{i}|X_{i})$ [Sun and Cai, 2007, Cai and Sun, 2009]. Putting these two ideas together, we arrive at the oracle multiple testing procedure in Algorithm 4.

Such a procedure indeed controls the $\operatorname{FDR}$ [Cai and Sun, 2009], if the conditional two-groups model (6) is true and the oracle has access to the true model. Data-driven approximations to this procedure can be developed by plugging in estimates of the conditional densities $f(t\mid x)$ and $\pi_{0}(\cdot)$ [Cai and Sun, 2009]. Such a procedure can be shown to be asymptotically consistent, albeit no finite-sample results are available.

Supplement S4: Estimation and optimization of the conditional two-groups model

S4.1 The nonparametric Grenander estimator

Our application of the Grenander estimator [Grenander, 1956] to estimating the conditional two-groups model begins by binning the covariate $X_{i}$ ; for example through quantile-slicing or as the leaves of a tree. Henceforth we will assume $X_{i}$ is discrete and $X_{i}\in[G]$ .

S4.1.1 Estimation

To estimate $\widehat{F}^{-\ell}(\cdot\mid g)$ , $g\in[G]$ we first form the ECDF (empirical cumulative distribution function) of the p-values $P_{i}$ with $i\notin I_{\ell}$ and $X_{i}=g$ . Then we compute the least concave majorant of the ECDF. The latter operation can be computed fast through weighted isotonic regression; as implemented for example in the gcmlcm function of the R package fdrtool [Strimmer, 2008a]. Furthermore, the computational complexity of fitting the Grenander estimator in one group is $O(m_{g}\cdot\log(m_{g}))$ , , where $m_{g}=\#\{i:X_{i}=g\}$ , and so it is of order $O(m\cdot\log(m))$ for all the groups.

The estimated $\widehat{F}^{-\ell}(\cdot\mid g)$ is a piecewise-linear, concave function. In particular, for a finite index set131313We omit the out-of-fold specification $(-\ell)$ from subsequent notation when it improves readability. $\mathcal{J}_{g}$ and real numbers $a_{j}^{g},b_{j}^{g}$ for $j\in\mathcal{J}_{G}$ it holds that

[TABLE]

For applications to $\operatorname{FDR}$ control we also need to estimate $\widehat{\pi}_{0}^{-\ell}(g)$ . We set it to $\widehat{\pi}_{0}^{-\ell}(g)=1$ in all our experiments. An alternative would be to apply a $\pi_{0}$ estimator developed in the setting without groups (such as the estimator of Storey et al. [2004]) to the p-values $P_{i}$ with $i\notin I_{\ell}$ and $X_{i}=g$ . This would yield $\widehat{\pi}_{0}^{-\ell}(g)$ .

S4.1.2 Optimization through linear programming

Using the Grenander estimator simplifies subsequent optimization in two ways: first, the optimization variable is $G$ -dimensional –instead of $\left\lvert I_{\ell}\right\rvert$ -dimensional– as all $i\in I_{\ell}$ with $X_{i}=g$ receive the same weight. Second, (46) enables us to cast the underlying convex optimization problems as linear programs by introducing additional variables $\bar{F}_{g}\in[0,1]$ ( $g=1,\dotsc,G$ ).

Optimization (8) for $k$ -Bonferroni:

Let $k_{\alpha}=\alpha k/m$ . We then solve the following linear program (LP) with optimization variables $(w_{g},\bar{F}_{g}),\;g=1,\dotsc,G$ :

[TABLE]

In our implementation, we solve this LP problem with the open-source SYMPHONY/Clp solver of the COIN-OR project [Lougee-Heimer, 2003] with the interface provided by the R/Bioconductor package lpsymphony [Kim, 2020]. Hypothesis $i$ in fold $I_{\ell}$ with covariate $X_{i}=g$ is then assigned weight $w_{g}$ , where ( $w_{1},\dotsc,w_{G}$ ) is the optimal weight vector of the above optimization problem.

Optimization (10) for BH:

The optimization here is similar with the difference that we optimize directly over the thresholds $t_{g},\;g=1,\dotsc,G$ and we also enforce the plug-in FDR constraint. Concretely, we solve the linear program with optimization variables $(t_{g},\bar{F}_{g}),\;g=1,\dotsc,G$ :

[TABLE]

Letting $(t_{1},\dotsc,t_{G})$ the optimal threshold vector, we then let

[TABLE]

unless all $t_{g}=0$ , in which case we set all $w_{g}=1$ . $w_{g}$ is the weight assigned to hypotheses $i\in I_{\ell}$ with $X_{i}=g$ .

Convex constraints on the weights:

For both linear programs (47) and (48) it is possible to incorporate additional linear constraints (so that the problems remain linear programs) that enforce weight functions of lower complexity. A concrete example is to enforce low total variation of the weight vector $(w_{1},\dotsc,w_{G})$ i.e., to enforce $\sum_{g=2}^{G}\left\lvert w_{g}-w_{g-1}\right\rvert\leq\lambda$ , for $\lambda\geq 0$ . This may be directly incorporated into (47). We may also add this constraint to problem (48) in terms of $t_{1},\dotsc,t_{G}$ as follows

[TABLE]

Throughout this work we set $\lambda=\infty$ (i.e., we do not add the above total variation constraints), with the exception of the application described in Supplement S5.

S4.1.3 Direct optimization

Here we describe an alternative optimization scheme that does not require the use of a linear programming solver and has guarantees on its computational complexity. For our numerical examples, however, the linear programming approach is fast enough.

We describe our algorithm for solving the $k$ -Bonferroni objective (8); the steps for the BH objective (48) are similar.

Let (46) be the fitted Grenander estimator in group $g$ . Let the non-zero slopes in group $g$ be sorted as $b_{1}^{g}>\dotsc>b_{\left\lvert\mathcal{J}_{g}\right\rvert}^{g}$ =0 and let $s_{j}^{g},j\in\mathcal{J}_{g}$ be the points at which the slope changes, i.e., the slope is equal to $b_{j}^{g}$ in the interval $(s_{j-1}^{g},s_{j}^{g})$ . At the boundaries we define $s_{0}^{g}=0$ and $s_{\left\lvert\mathcal{J}_{g}\right\rvert}^{g}=1$ . Further, consider the set:

[TABLE]

Algorithm 5 provides a computational routine for optimizing the objective (8) with computational complexity upper bounded by $O(m\cdot\log(m))$ .

The following proof verifies the correctness of the algorithm above and the worst-case computational complexity.

Proof.

We need to first check that the algorithm terminates, i.e., that there exists a $\lambda^{*}$ so that $1\in[\operatorname{WeightBudget}_{\ell}(\lambda),\operatorname{WeightBudget}_{u}(\lambda)]$ To this end, note that if we choose $\lambda=\max\mathcal{B}$ , then all $\ell_{g}(\lambda)=0$ and so $\operatorname{WeightBudget}_{\ell}(\lambda)=0$ . On the other hand, letting $\lambda=0$ , then we can pick all $u_{g}(\lambda)=1$ , i.e., $\operatorname{WeightBudget}_{u}(\lambda)=1/k_{\alpha}>1$ . It remains to observe that for adjacent $\lambda_{j},\lambda_{j+1}$ in $\mathcal{B}$ , it holds that

[TABLE]

and also that for all $\lambda$ , $\operatorname{WeightBudget}_{\ell}(\lambda)\leq\operatorname{WeightBudget}_{u}(\lambda)$ . As the algorithm terminates, we may now check computational complexity. First, note that $\left\lvert\mathcal{B}\right\rvert=O(m)$ since the Grenander estimator can only jump at support points of the per-group empirical distribution function. Thus the initial sorting step of $\mathcal{B}$ requires at most $O(m\log(m))$ operations. The ‘while’ loop of the algorithm proceeds by bisection of $\mathcal{B}$ , hence will comprise of at most $O(\log(m))$ iterations and the cost of each iteration step is at most $O(m)$ . Computation after the while loop is negligible ( $O(G)$ operations). Thus, the total complexity of this algorithm is $O(m\log(m))$ at most.

Second, we need to check the Karush–Kuhn–Tucker (KKT) [Rockafellar, 1970] conditions for convex programming to verify the optimality of the weights returned by Algorithm 5. Let $\bm{\nu}=(\nu_{1},\dotsc,\nu_{G})$ the dual variables corresponding to the non-negativity constraint and $\lambda$ the dual variable corresponding to the weight-budget constraint. The Lagrangian takes the form

[TABLE]

We seek to specify dual and primal optimal variables. We set the dual $\lambda^{*}$ and primal $w_{g}^{*}$ as described in the last steps of Algorithm 5. For the dual variables corresponding to the non-negativity constraints, we set $\nu^{*}_{g}=0$ if $w_{g}^{*}>0$ and $\nu_{g}^{*}=\tilde{m}_{g}\cdot k_{\alpha}\left(\lambda^{*}-b_{1}^{g}\right)$ if $w_{g}=0$ . Complementary slackness thus holds by construction. Furthermore, note that when $w_{g}^{*}=0$ , then Algorithm 5 ensures that $\lambda^{*}\geq b_{1}^{g}$ and so $\nu_{g}^{*}\geq 0$ . Hence for all $g$ , $\nu_{g}^{*}\geq 0$ and so dual feasibility holds. Primal feasibility, i.e., $w_{g}^{*}\geq 0$ and $\sum\tilde{m}_{g}w_{g}^{*}=\tilde{m}$ also hold by construction.

It remains to check that stationarity holds. Let us take take the superdifferential of the Lagrangian along the $g$ -th coordinate, where we keep $g\in[G]$ fixed.

[TABLE]

We next distinguish two cases according to the value of $w_{g}^{*}$ .

Case 1, $w_{g}^{*}=0$ : In this case, $b_{1}^{g}\in\partial\widehat{F}^{-\ell}\left(w_{g}^{*}\cdot k_{\alpha}\,\big{|}\,g\right)$ and $\nu_{g}^{*}$ is defined precisely so that $0\in\partial_{g}L(\mathbf{w}^{*},\lambda^{*},\mathbf{\nu}^{*})$ .

Case 2, $w_{g}^{*}>0$ : First let us quickly study $\partial\widehat{F}^{-\ell}(t\,\big{|}\,g)$ for $t\in(0,1)$ . If $t=s_{j}^{g}$ for $j\in\mathcal{J}_{g}$ , then $\partial\widehat{F}^{-\ell}(t\,\big{|}\,g)=[b_{j+1}^{g},b_{j}^{g}]$ and if $t\in(s_{j-1}^{g},s_{j}^{g})$ , then $\partial\widehat{F}^{-\ell}(t\,\big{|}\,g)=\left\{b_{j}^{g}\right\}$ . In both cases, it holds that $\lambda^{*}\in\partial\widehat{F}^{-\ell}(t\,\big{|}\,g)$ and so, since $\nu_{g}^{*}=0$ , it again follows that $0\in\partial_{g}L(\mathbf{w}^{*},\lambda^{*},\mathbf{\nu}^{*})$ . ∎

S4.2 Beta-uniform mixture GLM

In this section we consider the conditional two-groups model (6) with parametrization (LABEL:eq:betamix_simulation_model), where we assume throughout that $\beta(x)<1$ and $0<\pi_{0}(x)<1$ hold strictly. We first explain how to estimate the parameters of the conditional two-groups model given access to $X_{i}$ and censored p-values ( $P_{i}\mathbf{1}(P_{i}>\tau)$ ) and then we explain the optimization procedure for deriving optimal weights.

As a preliminary step, we introduce explicit notation for the CDF and pdf of the $\text{Beta}(\beta,1)$ distribution

[TABLE]

S4.2.1 Estimation

In this section we let $Y_{i}=-\log(P_{i})$ . Our goal is estimation based on the censored data outside fold $\ell$ , i.e., $D_{-\ell}(\tau)=((X_{i},P_{i}\mathbf{1}(P_{i}>\tau)))_{i\in I_{\ell}^{c}}$ 141414The case $\tau=0$ corresponds to no censoring. Without cross-weighting we use the data corresponding to all indices $i=1,\dotsc,m$ .. We will proceed by maximum likelihood estimation and optimize the (non-convex) objective through the EM algorithm. The full-data (i.e., if we could observe $((X_{i},P_{i},H_{i}))_{i\in I_{\ell}^{c}}$ ) log-likelihood decouples into the sum of the log-likelihood of two generalized linear models (GLMs); a binomial GLM and a Gamma GLM (cf. (16) in Lei and Fithian [2018])

[TABLE]

During the $r$ -th iteration of EM, we keep track of the imputed data (E-step; more below) $\hat{Y}^{(r)}$ , $\hat{H}^{(r)}$ . Furthermore in principle we should keep track of the parameters $\hat{a}_{0}^{(r)},\hat{a}^{(r)},\hat{b}_{0}^{(r)},\hat{b}^{(r)}$ . Instead, we keep track of $\hat{\pi_{0}}^{(r)}(x)=\operatorname*{expit}(-\hat{a}_{0}^{(r)}-\hat{a}^{(r)\top}x)$ and $\hat{\beta}^{(r)}(x)=\hat{b}_{0}^{(r)}+\hat{b}^{(r)\top}x$ both evaluated at $X_{i},i\in I_{\ell}^{c}$ .

We now describe the details of the EM algorithm.

E-step:

For the $r$ -th E-step, we need to compute:

[TABLE]

This boils down to computing

[TABLE]

and plugging these into (51) in lieu of $H_{i},Y_{i}$ .

–

$\hat{H}_{i}^{(r)}$ update:

[TABLE]

–

$\hat{Y}_{i}^{(r)}$ update:

[TABLE]

M-step:

As already alluded to, the M-step consists of fitting two GLMs, a (quasi)binomial GLM ( $\hat{H}_{i}^{(r)}\in[0,1]$ takes on fractional values) and a weighted Gamma GLM. In R pseudocode, the M-step is as follows:

–

[TABLE]

–

[TABLE]

In this step we also seek to ensure $\beta(x)<1,0<\pi_{0}(x)<1$ (so that strict concavity of the estimated p-value distribution holds conditionally on all $x$ ). To this end we introduce parameters $\pi_{0,\text{min}},\pi_{0,\text{max}}$ and $\beta_{\text{max}}$ and clamp the $\beta(x),\pi_{0}(x)$ estimates ( $\operatorname{clamp}(x;a,b)=\max\left\{\min\left\{x,b\right\},a\right\}$ ) to the above ranges. In our implementation we use $\pi_{0,\text{min}}=0.1,\pi_{0,\text{max}}=0.99$ and $\beta_{\text{max}}=0.9$ (we have not needed to lower bound $\beta$ in our experiments).

Initialization:

–

$\hat{Y}^{(0)}$ : We initialize $\hat{Y}_{i}$ by $Y_{i}$ if $P_{i}$ is not censored and by $-\log(\tau/2)$ otherwise.

[TABLE]

–

$\hat{\pi}_{0}^{(0)}$ : The $\hat{\pi}_{0}(X_{i})$ are initialized through the procedure of Boca and Leek [2018]. First, let $\tau^{\text{BL}}\geq\tau$ ; in our simulations we use $\tau^{\text{BL}}=0.5$ . Then we fit a logistic regression of $\mathbf{1}\left(P_{i}\geq\tau^{\text{BL}}\right)$ onto $X_{i}$ , let $\hat{\mathbb{P}}[P_{i}\geq\tau^{\text{BL}}\mid X_{i}]$ the fitted probabilities and finally we set

[TABLE]

–

$\hat{H}^{(0)}$ : We first compute the adjusted p-values $\text{adj}P_{i}$ of the BH procedure applied to $P_{i}\lor\tau$ (i.e., in R pseudocode: p.adjust(pmax(Ps, tau), method="BH")) and then we set:

[TABLE]

Output:

Let $r^{*}$ the final iteration of the EM algorithm, we keep $\hat{\beta}^{-\ell}(\cdot)=\hat{\beta}^{(r^{*})}(\cdot)$ and $\hat{\pi}_{0}^{(r^{*})}(\cdot)$ . These fully specify the estimated conditional distribution

[TABLE]

When learning weights for IHW BH as in (10), we set the estimated conditional distribution as above, while keeping $\hat{\pi}_{0}^{-\ell}(\cdot)=1$ instead of using the output from the EM algorithm. An alternative, following Markitsis and Lai [2010], would be to set

[TABLE]

S4.2.2 Optimization

The estimated conditional distributions and densities take the form:

[TABLE]

Optimization (8) for $k$ -Bonferroni:

We seek to maximize $\sum_{i\in I_{\ell}}\widehat{F}^{-\ell}(k_{\alpha}\cdot w_{i}\mid X_{i})$ subject to $w_{i}\geq 0$ , $\sum_{i\in I_{\ell}}w_{i}=\left\lvert I_{\ell}\right\rvert$ , where $k_{\alpha}=\alpha k/m$ . This a convex optimization problem and furthermore strong duality is attained, e.g., by Slater’s condition (also note that the program is feasible; take $w_{i}=1$ ).

Assume momentarily that the optimizer satisfies $w_{i}>0$ for all $i$ . Let $\lambda$ be the Lagrange multiplier corresponding to the constraint $\sum_{i\in I_{\ell}}w_{i}=\left\lvert I_{\ell}\right\rvert$ . Then, differentiating the Lagrangian with respect to $w_{i}$ , we see that it must hold that:

[TABLE]

So:

[TABLE]

Since $\hat{\beta}^{-\ell}(X_{i})<1$ and $\hat{\pi}^{-\ell}_{0}(X_{i})<1$ by our estimation procedure, we may solve the equation above analytically for $w_{i}>0$ . We call this solution $w_{i}(\lambda)$ . Then we use bisection over $\lambda$ to find $\lambda^{*}$ such that the equality constraint is satisfied, i.e., $\sum_{i\in I_{\ell}}w_{i}(\lambda^{*})=\left\lvert I_{\ell}\right\rvert$ . Then the optimizing weights are $w_{i}=w_{i}(\lambda^{*})$ .

We may derive the computational complexity of the optimization step as follows: We can minimize the Lagrangian analytically in $O(m)$ operations. To find the optimal dual variable $\lambda^{*}$ we need to use bisection. Thus, we need roughly $O(m\cdot\log(1/\delta))$ operations, where $\delta$ is a parameter controlling tolerance (accuracy).

Optimization (10) for BH:

Here we seek to maximize $\sum_{i\in I_{\ell}}\widehat{F}^{-\ell}(t_{i}\mid X_{i})$ over $t_{i}\geq 0$ subject to $\sum_{i\in I_{\ell}}\hat{\pi}^{-\ell}_{0}(X_{i})t_{i}\leq\alpha\sum_{i\in I_{\ell}}\widehat{F}^{-\ell}\left(t_{i}\mid X_{i}\right)$ . We may directly verify the conditions of Theorem 2 in Lei and Fithian [2018] (which ensures strong duality) and conclude that there exists $\lambda\in(0,1)$ such that at the optimal solution:

[TABLE]

Here $\operatorname{fdr}^{-\ell}(t_{i}\mid X_{i})$ is defined as in (43) with population quantities replaced by estimated ones. Rearranging, this implies that:

[TABLE]

As already described for $k$ -Bonferroni, for each fixed $\lambda$ we may solve the above expression analytically for $t_{i}$ , say by $t_{i}(\lambda)>0$ . Then it only remains to use bisection to find $\lambda^{*}$ such that

[TABLE]

Finally, hypothesis $i\in I_{\ell}$ is assigned weight $W_{i}=\left\lvert I_{\ell}\right\rvert\cdot t_{i}(\lambda^{*})\big{/}\sum_{j\in I_{\ell}}t_{j}(\lambda^{*})$ .

We note that here, just as for $k$ -Bonferroni, the computational complexity scales as $O(m\cdot\log(1/\delta))$ operations, where $\delta$ is a parameter controlling tolerance (accuracy).

Supplement S5: More details on the data application of Section 6

For the hQTL example, we used the dataset described in Grubert et al. [2015] and looked for associations between SNPs and the histone modification mark (H3K27ac) on human Chromosomes 1 and 2. p-values for association were calculated using Matrix eQTL [Shabalin, 2012].

As a covariate we used the linear genomic distance between the SNP and the ChIP-seq signal, which we discretized using non-uniform binning: the bins corresponded to genomic segments of length $10$ kb (kilobase) bins up to $300$ kb (i.e., the categories were $0-10$ kb, $10-20$ kb, $\dotsc$ , $290-300$ kb), to segments of length $100$ kb up to $1$ Mb and finally to segments of length $10$ Mb for the rest of the hypotheses. The longest genomic distance between SNPs and H3K27ac was approximately equal to $24$ Mb.

For the application of IHW-BY (cf. Theorem 4), we split hypotheses into two folds corresponding to the two chromosomes. Honest weights are learned within each fold with the strategy described in Section 4.2 and Supplement S4.1 based on the Grenander estimator. Note that we set $\hat{\pi}_{0}^{-\ell}=1$ . Furthermore, we apply a mild constraint on the total variation of the learnd weights, i.e. by including the constraint (49) with $\lambda=2000$ in the linear programming problem (48).

Supplement S6: Choice and examples of informative covariates

Covariates that can take the role of $X_{i}$ in the conditional two-groups model (6) are available in many multiple testing applications of practical interest, and in this section we discuss a range of examples. We will group them into domain-specific and statistical covariates. Whereas the former derive from an understanding of the data-generating process, the latter reflect mathematical properties of the specific test procedure used to compute the p-values. Domain-specific covariates are often informative about prior probabilities (i.e., the function $\pi_{0}(x)$ depends on $x$ ), statistical covariates about the power of the test and thus the shape of the alternative distribution function $F_{\text{alt}}(\cdot\mid X_{i}=x)$ . The categorization is informal, loose and partially overlapping.

For a given application, there will often be more than one possible choice of covariate. In our formulation of the conditional two-groups model (6), we assume for simplicity of notation that $X_{i}$ is either one particular choice, or the combination of several original covariates into a single “effective” covariate, e.g., by taking the Cartesian product. The details of how to select or combine will depend on the application and the data and are beyond the scope of this paper.

S6.1 Domain-specific covariates

In many scientific applications, informative covariates are apparent to domain scientists due to mechanistic insight or prior experience. Examples include:

•

Genomic distance between SNPs and peaks. This is the covariate in our motivating example in Figure 1 and Section 6. The p-values are from testing the association between SNPs and H3K27ac peak heights across different individuals from the human population. The choice of covariate is motivated by the expectation that many of the true instances where a DNA polymorphism affects a H3K27ac peak are short-range, so that $\pi_{0}$ for hypotheses with a short distance is smaller than for those where SNP and peak are far apart.

•

Physical distance between pairs of firing neurons. It is now possible to simultaneously measure the activity of many neurons, and there is interest in determining whether two neurons are firing in synchrony [Scott et al., 2015]. We know that neurons in close proximity are a-priori more likely to be interacting, thus, the distance between neurons can be used as a covariate for association tests between pairs of neurons.

•

Gene expression patterns in nearby genetic variants. Genome-wide association studies (GWAS) look for statistical associations between genetic variants in a population with prevalence of a disease. Once discovered, such an association can be the basis for a follow-up mechanistic study. Sample size and power tend to be limiting bottlenecks of many GWAS due to multiple testing and to the study’s expense. Power can be increased by considering (phenotype-unrelated) gene expression patterns around the loci of the genetic variants [Baillie et al., 2018].

•

P-values from a distinct but related experiment. For example, Fortney et al. [2015] used data from previous, independent GWAS for related diseases to increase the power of a GWAS study of a longevity phenotype.

In a different context—multivariate regression rather than hypothesis testing—the widespread existence of such covariates was observed by van De Wiel et al. [2016], who used the term “co-data” for them and developed a weighted ridge regression procedure, with data-driven penalization weights.

S6.2 Statistical covariates

In single hypothesis testing, classical theory [Lehmann and Romano, 2005] dictates that the whole dataset should be reduced to a sufficient statistic, which in turn can be used to derive the best test statistic under optimality considerations. Everything else, can be discarded or should be conditioned on. This data compression comes without any loss of statistical power.

However, the $m$ resulting p-values for the individual tests are in general not able to capture how one should weigh the hypotheses relative to each other to arrive at an optimal multiple testing protocol [Storey, 2007]. The consequence is that information irrelevant for single hypothesis testing can be embedded in the conditional two-groups framework and can help increase the power of the resulting multiple testing procedure; sometimes dramatically so.

S6.2.1 Sample size

A generic covariate, likely to be useful whenever it differs across tests, is the sample size $N_{i}$ . Note that if the test statistic is continuous and the null hypothesis is simple, then the p-value $P_{i}$ under the null is uniformly distributed independently of $N_{i}$ . Often, there is no reason to expect that the prior probability of a hypothesis being true depends on $N_{i}$ . However, the alternative distribution will depend on $N_{i}$ : for higher sample size, we have more power.

A simple, but generic and instructive example is as follows: consider a series of one-sided $z$ -tests in which we observe independent $Y_{1}^{i},\dotsc,Y_{N_{i}}^{i}\sim\mathcal{N}(\mu_{i},1)$ , where $\mu_{i}>0$ if $H_{i}=1$ and $\mu_{i}=0$ otherwise. We can use $P_{i}=1-\Phi\left(N_{i}^{1/2}\;\overline{Y^{i}}\right)$ as our statistic, where $\overline{Y^{i}}$ is the sample average of $Y_{1}^{i},\dotsc,Y_{N_{i}}^{i}$ . Then the alternative distribution of the $i$ -th test is

[TABLE]

Now consider the case in which $\pi_{0,i}=\pi_{0}$ and $\mu_{i}=\mu H_{0}\;\forall i$ , i.e., a common prior probability and a common effect size. In this case, Equation (52) leads to the conditional two-groups model with covariate $N_{i}$ and $F_{\text{alt},i}(t)=F_{\text{alt}}(t\mid N_{i})$ . Then, to maximize discoveries and thus power, hypotheses with large sample sizes $N_{i}$ should be prioritized. The methods described here are able to accomplish this automatically.

Remark 1.

At this point, readers might ask themselves whether this is desirable – since, in practice, different effect sizes $\mu_{i}$ may be present. Prioritizing hypotheses with large sample sizes $N_{i}$ will lead to a trade-off where some discoveries with smaller $N_{i}$ but higher $\mu_{i}$ are missed, for the benefit of making more discoveries with larger $N_{i}$ but smaller $\mu_{i}$ . Yet, the former might be more valuable to us. In a way, one can draw analogies to the streetlight effect: if we have lost our keys during a walk at night and have no idea where it happened, it makes sense to start searching under the streetlight, where it is easiest to see. However, if we do have guesses where we might have dropped them, it makes sense to combine these guesses with the ease of seeing in each place to arrive at an optimal search schedule.

Remark 2.

The optimal weights are not necessarily a monotonic function of the sample size. With IHW, it is possible that hypotheses with covariates associated with very large sample size (or effect size) are down-weighted relative to more intermediate hypotheses. This phenomenon is called size-investing [Roeder et al., 2007, Peña et al., 2011, Ignatiadis et al., 2016]. The intuition is that higher weights should be preferentially allocated where they make most difference – and little to hypotheses that are anyway exceedingly easy or hard to reject.

S6.2.2 Overall variance (independent of label) in ANOVA tests

In Section 5.3 we demonstrated a covariate that can be used to improve power in the simultaneous two-sample testing problem for equality of means in the case of known variances. Here we extend the discussion to the case of unknown variances; cf. Cai et al. [2019] for a comprehensive treatment of more general forms of this problem.

Our data is drawn from model (15). We are interested in testing $H_{i}:\mu_{Y,i}=\mu_{V,i}$ and do not know $\sigma_{i}$ . The optimal test statistic for this situation is the two-sample $t$ -statistic:

[TABLE]

where $\overline{Y_{i}}$ and $\overline{V_{i}}$ are the sample means and $S_{Y,i}^{2}$ and $S_{V,i}^{2}$ the sample variances. In addition, denote by $\hat{\mu}_{i}\coloneqq\frac{1}{2}\left(\overline{Y_{i}}+\overline{V_{i}}\right)$ and $S_{i}^{2}$ the sample mean and sample variance after pooling all observations ( $Y_{i,1},\dotsc,Y_{i,n},V_{i,1},\dotsc,V_{i,n}$ ) and forgetting their labels.

Now note that under the null hypothesis, $\mu_{Y,i}=\mu_{V,i}=\mu_{i}$ and $Y_{i,1},\dotsc,Y_{i,n}$ , $V_{i,1},\dotsc,V_{i,n}\sim\mathcal{N}(\mu_{i},\sigma_{i}^{2})$ i.i.d. Then, $(\hat{\mu}_{i},S_{i}^{2})$ is a complete sufficient statistic for the experiment, while $T_{i}$ is ancillary for $(\mu_{i},\sigma_{i}^{2})$ . Thus, by Basu’s theorem, $(\hat{\mu}_{i},S_{i}^{2})$ is independent of $T_{i}$ and we can use it as a covariate.

Now consider $S_{i}^{2}$ in particular and note that under the null it is distributed as a scaled $\chi^{2}$ -distribution. On the other hand, under the alternative, we expect $S_{i}^{2}$ to take larger values with high probability, especially if $|\mu_{Y,i}-\mu_{V,i}|$ is large. Therefore, if we are doing $m$ $t$ -tests, each with unknown variance $\sigma_{i}^{2}$ and if we assume $\sigma_{i}\sim G$ from a concentrated distribution $G$ , then hypotheses with high $S_{i}^{2}$ are more likely to be true alternatives (and also likely to be alternatives with high power). Thus, the overall variance (ignoring sample labels) is not only independent of the p-values under the null hypothesis, but also informative about the alternatives. Using it as a covariate can lead to a large power increase in simultaneous two-sample $t$ -tests [Bourgon et al., 2010, Ignatiadis et al., 2016]. The result extends to more complex ANOVA settings.

For a second example of the usefulness of $(\hat{\mu}_{i},S_{i}^{2})$ in this setting, consider the screening statistic $|\hat{\mu}_{i}|/S_{i}$ . This can be interpreted as a statistic for the null hypothesis $\mu_{Y,i}=\mu_{V,i}=0$ . If we believe a-priori that for many of the hypotheses $i$ with $\mu_{Y,i}=\mu_{V,i}$ a sparsity condition holds, so that in fact $\mu_{Y,i}=\mu_{V,i}=0$ [Liu, 2014], then large values of this statistic are more likely to correspond to alternatives, cf. Section 5.3.

Remark 3.

In single hypothesis testing, there is nothing to be gained from $(\hat{\mu}_{i},S_{i}^{2})$ . Its usefulness only emerges in the multiple testing setup.

S6.2.3 Ratio of number of observations in each group in two-sample tests

For yet another example, revisit the two-sample situation, but now assume that for the $i$ -th hypothesis, we have $n_{1,i}$ observations of the first population and $n_{2,i}$ observations from the second population, such that $n_{1,i}+n_{2,i}=n_{i}$ . Then $n_{1,i}\,n_{2,i}/n_{i}^{2}$ is a statistic which is related to the alternative distribution, with values close to $\frac{1}{4}$ implying higher power [Roquain and Van De Wiel, 2009]. This statistic is also related to the Minor Allele Frequency (MAF) in genome-wide association studies [Boca and Leek, 2018].

S6.2.4 Sign of estimated effect size

As a final example of a statistical covariate, consider a two-sided test where the null distribution is symmetric and the test-statistic is the absolute value of a symmetric statistic $T_{i}$ . Then, the sign of $T_{i}$ is independent of the p-value under the null hypothesis. However, we might a-priori believe that among the alternatives, more have one or the other sign of effect size. Thus, the sign can be used as an informative covariate. Previous uses of stratification by sign to improve power include the SAM (significance analysis of microarrays) procedure [Tusher et al., 2001].

Bibliography86

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allison et al. [2002] David B Allison, Gary L Gadbury, Moonseong Heo, José R Fernández, Cheol-Koo Lee, Tomas A Prolla, and Richard Weindruch. A mixture model approach for the analysis of microarray gene expression data. Computational Statistics & Data Analysis , 39(1):1–20, 2002.
2Arias-Castro and Chen [2017] Ery Arias-Castro and Shiyun Chen. Distribution-free multiple testing. Electronic Journal of Statistics , 11(1):1983–2001, 2017.
3Baillie et al. [2018] J. Kenneth Baillie, Andrew Bretherick, Christopher S. Haley, Sara Clohisey, Alan Gray, Lucile P. A. Neyton, Jeffrey Barrett, Eli A. Stahl, Albert Tenesa, Robin Andersson, J. Ben Brown, Geoffrey J. Faulkner, Marina Lizio, Ulf Schaefer, Carsten Daub, Masayoshi Itoh, Naoto Kondo, Timo Lassmann, Jun Kawai, Damian Mole, Vladimir B. Bajic, Peter Heutink, Michael Rehli, Hideya Kawaji, Albin Sandelin, Harukazu Suzuki, Jack Satsangi, Christine A. Wells, Nir Hacohen, Thomas C. Fre
4Barber and Candès [2015] Rina Foygel Barber and Emmanuel J Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics , 43(5):2055–2085, 2015.
5Basu et al. [2018] Pallavi Basu, T Tony Cai, Kiranmoy Das, and Wenguang Sun. Weighted false discovery rate control in large-scale multiple testing. Journal of the American Statistical Association , 113(523):1172–1183, 2018.
6Benjamini [2008] Yoav Benjamini. Comment: Microarrays, empirical Bayes and the two-groups model. Statistical Science , 23(1):23–28, 2008.
7Benjamini and Hochberg [1995] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Statistical Methodology) , pages 289–300, 1995.
8Benjamini and Hochberg [1997] Yoav Benjamini and Yosef Hochberg. Multiple hypotheses testing with weights. Scandinavian Journal of Statistics , 24(3):407–418, 1997.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Summary

1 Introduction

1.1 Independent Hypothesis Weighting

1.2 Related work

1.3 Outline

2 Weighted and cross-weighted multiple testing

Definition 1** (Weighted kkk-Bonferroni).**

Definition 2** (τ\tauτ-censored, weighted Benjamini-Hochberg).**

2.1 Example: Group Benjamini-Hochberg with cross-weighting

2.2 IHW: A family of multiple testing procedures

2.3 Finite-sample FDR control with cross-weighting under independence

Assumption 1** (Distributional setting under independence).**

Specification 1** (Honest weighting).**

Specification 2** (τ\tauτ-censored weighting).**

Theorem 1** (IHW-BH controls the FDR⁡\operatorname{FDR}FDR under honesty and τ\tauτ-censored weighting).**

Corollary 1** (IHW-GBH controls the FDR⁡\operatorname{FDR}FDR).**

Proof.

Theorem 2** (IHW-Storey controls the FDR⁡\operatorname{FDR}FDR under honesty and τ\tauτ-censored weighting).**

2.4 FDR asymptotics with cross-weighting under independence

Specification 3** (Weighting scheme).**

Proposition 1**.**

Proof idea for (a) and (b):.

Corollary 2** (IHW-GBH asymptotics).**

Proof.

3 Extension to dependence

3.1 The key assumption: Independence across folds, dependence within

Assumption 2** (Distributional setting with dependence).**

3.2 kkk-FWER control with cross-weighting under dependence

Theorem 3**.**

Proof.

3.3 FDR control with cross-weighting under dependence

Definition 3** (Weighted Benjamini-Yekutieli (wBY) (Benjamini and Yekutieli, 2001; Blanchard and Roquain, 2008)).**

Theorem 4** (IHW-BY controls the FDR⁡\operatorname{FDR}FDR under honesty and independent folds).**

Example 1** (BY with arbitrary data-driven weights does not control FDR⁡\operatorname{FDR}FDR under Assumption 2).**

Proof.

Example 2** (AdaPT with BY correction does not control FDR⁡\operatorname{FDR}FDR under Assumption 2).**

Proof.

4 Learning powerful weighting rules

4.1 Learning weights for IHW kkk-Bonferroni

4.2 Learning weights for IHW Benjamini-Hochberg

5 Numerical experiments

5.1 Grouped multiple testing

5.2 Multiple testing with continuous covariates

Breaking AdaPT:

5.3 Simultaneous two-sample testing

6 Application example: biological high-throughput data

7 Further relations to previous work

7.1 Ignatiadis, Klaus, Zaugg, and

7.2 Sample splitting

7.3 The weighted False Discovery Rate

8 Discussion

Code availability and reproducibility

Acknowledgments

Supplement S1: Finite-sample results for FDR control of IHW

S1.1 A preliminary lemma

Lemma 1**.**

Proof.

S1.2 The IHW-BH procedure under independence: Proof of Theorem 1

Proof.

S1.3 Counterexample to demonstrate that honesty of weights does not suffice for FDR control (due to M. Sklar)

S1.4 The IHW-Storey procedure under independence: Proof of Theorem 2

Proof.

S1.5 The IHW-BY procedure under dependence: Proof of Theorem 4

Proof.

Lemma 2**.**

S1.6 Counterexample to demonstrate that BY with τ\tauτ-censored data-driven weights does not control FDR⁡\operatorname{FDR}FDR

Proof.

Supplement S2: Proofs for IHW-BH asymptotics

Assumption 3** (Regularity of conditional two-groups model).**

Some remarks on notation:

S2.1 Proof of Proposition 1(a)

Proof.

S2.2 Proof of Proposition 1(b)

Definition 1 (Weighted $k$ -Bonferroni).

Definition 2 ( $\tau$ -censored, weighted Benjamini-Hochberg).

Assumption 1 (Distributional setting under independence).

Specification 1 (Honest weighting).

Specification 2 ( $\tau$ -censored weighting).

Theorem 1 (IHW-BH controls the $\operatorname{FDR}$ under honesty and $\tau$ -censored weighting).

Corollary 1 (IHW-GBH controls the $\operatorname{FDR}$ ).

Theorem 2 (IHW-Storey controls the $\operatorname{FDR}$ under honesty and $\tau$ -censored weighting).

Specification 3 (Weighting scheme).

Proposition 1.

Corollary 2 (IHW-GBH asymptotics).

Assumption 2 (Distributional setting with dependence).

3.2 $k$ -FWER control with cross-weighting under dependence

Theorem 3.

Definition 3 (Weighted Benjamini-Yekutieli (wBY) (Benjamini and Yekutieli, 2001; Blanchard and Roquain, 2008)).

Theorem 4 (IHW-BY controls the $\operatorname{FDR}$ under honesty and independent folds).

Example 1 (BY with arbitrary data-driven weights does not control $\operatorname{FDR}$ under Assumption 2).

Example 2 (AdaPT with BY correction does not control $\operatorname{FDR}$ under Assumption 2).

4.1 Learning weights for IHW $k$ -Bonferroni

Lemma 1.

Lemma 2.

S1.6 Counterexample to demonstrate that BY with $\tau$ -censored data-driven weights does not control $\operatorname{FDR}$

Assumption 3 (Regularity of conditional two-groups model).

Optimization (8) for $k$ -Bonferroni:

Optimization (8) for $k$ -Bonferroni:

Remark 1.

Remark 2.

Remark 3.