Powerful large scale inference in high dimensional mediation analysis

Asmita Roy; Xianyang Zhang

PMC · DOI:10.1371/journal.pcbi.1013880·January 14, 2026

Powerful large scale inference in high dimensional mediation analysis

Asmita Roy, Xianyang Zhang

PDF

Open Access

TL;DR

This paper introduces MLFDR, a new method for high-dimensional mediation analysis that improves statistical power and identifies more significant mediators in genome-wide studies.

Contribution

The novel MLFDR method extends local false discovery rate principles to composite null hypotheses in high-dimensional mediation analysis.

Findings

01

MLFDR asymptotically controls false discovery rate and achieves superior statistical power in simulations.

02

MLFDR identified 20%–50% more significant mediators in real data compared to existing methods.

03

The method includes a two-step global latent factor adjustment using surrogate variable analysis.

Abstract

In genome-wide epigenetic studies, determining how exposures (e.g., Single Nucleotide Polymorphisms) affect outcomes (e.g., gene expression) through intermediate variables, such as DNA methylation, is a key challenge. Mediation analysis provides a framework to identify these causal pathways; however, testing for mediation effects involves a complex composite null hypothesis. Existing methods, such as Sobel’s test or the Max-P test, are often underpowered in this context because they rely on null distributions determined under only a subset of the null space and are not optimized for the multiple testing burden inherent in high-dimensional data. To address these limitations, we introduce MLFDR (Mediation Analysis using Local False Discovery Rates), a novel method for high-dimensional mediation analysis. MLFDR leverages local false discovery rates, calculated from the coefficients of…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes6

TRIM26 IRX4 LY6K MAPT TCIRG1 CFAP251

Proteins6

Species1

Homo sapiens(human · species)

Chemicals1

MLFDR

Mutations5

rs3129859 rs5945619 rs7767188 rs12653946 rs3096702

Figures11

Click any figure to enlarge with its caption.

Fig 2 — FDR and power comparison for the linear model (Setting 1).Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

Fig 3 — FDR and power comparison for the linear model with measured confounders (Setting 2).Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

Fig 4 — FDR and power comparison for the linear model with binary outcomes (Setting 3).Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

Fig 5 — FDR and power comparison for the linear model with unknown mediator-exposure interactions.Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

Fig 6 — FDR and power comparison for the linear model with unmeasured confounders.Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

Fig 7 — FDR and power comparison for composite alternatives.Results are displayed for varying degrees of mediation (τ). Gray ribbons indicate error margins.

Fig 8 — AIC for methylation coefficients when a d2 + 1 component Gaussian mixture model was fit.

Fig 9 — SNP-CpG-Gene triplets identified by MDACT, MLFDR, and HDMT out of 69,602 tests.The Venn diagram (left) corresponds to an FDR cutoff of 0.01.

Fig 10 — Smoking-CpG-Gene pathways detected by MDACT, MLFDR, and HDMT out of 319,761 tests.The Venn diagram (left) displays overlaps at an FDR cutoff of 0.01, while the plot (right) shows detection counts across varying FDR thresholds.

Fig 11 — Smoking-CpG-gene pathways detected by all methods under the global null out of m=1,000 tests.The proportion of cases reporting positive detection counts is reported across varying FDR thresholds. The gray solid line indicates target FDR levels.

Equations43

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Genetic Associations and Epidemiology · Advanced Causal Inference Techniques

Full text

1 Introduction

Mediation analysis serves as a critical tool for deciphering the biological mechanisms underlying genetic associations with diseases identified in Genome-Wide Association Studies (GWAS). By bridging the gap between genetic variants and clinical outcomes, mediation analysis reveals intermediate pathways and elucidates causal relationships. As GWAS continues to uncover a vast number of genetic associations, translating these findings into actionable insights for precision medicine and therapeutic development becomes increasingly important. For instance, cigarette smoking is known to alter DNA methylation and gene expression [12]; concurrently, DNA methylation often regulates gene expression directly [4,14]. Investigating the mediating effect of DNA methylation on gene expression—particularly in the presence of environmental exposures like smoking—is therefore essential. However, these analyses are complicated by high-dimensional outcomes and clinical confounders, such as patient age, which influences both gene expression and DNA methylation heterogeneity [7,19]. This article addresses the statistical challenges inherent in such high-dimensional mediation problems.

Historically, [1] introduced the regression-based definition of mediation analysis, often referred to as the “product of coefficients method,” which examines the significance of the product of the exposure-mediator and mediator-outcome coefficients. More recently, the literature has expanded through the “counterfactual framework” [8,15,17,23–26], which provides a causal interpretation for natural direct and indirect effects across various models, including those with non-linearities and binary or survival outcomes.

Let X denote the exposure, Mi the ith mediator, and Y the outcome. Under the product of coefficients approach, mediation analysis tests the null hypothesis $[eqn]$ , where $[eqn]$ represents the effect of X on Mi, and $[eqn]$ represents the effect of Mi on Y. This creates a composite null hypothesis comprising three distinct cases: (i) $[eqn]$ ; (ii) $[eqn]$ ; or (iii) $[eqn]$ . Assuming no unmeasured confounders, classical tests like Max-P [13] and Sobel’s test [18] are known to be conservative under case (iii), as statistical inference is typically derived from distributions determined by cases (i) and (ii). In genome-wide studies, however, the sparse nature of omics data implies that $[eqn]$ and $[eqn]$ hold for the majority of markers. Recent methods such as JS-mixture (HDMT) [3] and DACT [11] attempt to address this by explicitly modeling the composite nature of the null. JS-mixture improves power by using a mixture-null distribution of maximum p-values, adapting [20]’s procedure to estimate component proportions. DACT estimates the proportions of null $[eqn]$ and $[eqn]$ separately to combine case-specific p-values. However, [30] recently demonstrated that DACT suffers from False Discovery Rate (FDR) inflation under dense alternatives and proposed a modified version (MDACT) that computes the statistic’s distribution via numerical integration to improve p-value accuracy.

While JS-mixture and MDACT offer improvements over classical methods, they are not theoretically optimal regarding power. The FDR literature is broadly divided into p-value-based and local FDR-based rejection regions. Local FDR, a Bayesian approach, ranks hypotheses by the posterior probability that a case is null given the observed statistics; this ranking often differs from that based on p-values. [22] demonstrated that, except in cases of symmetric alternatives, local FDR and p-value-based orderings diverge. Furthermore, [22] proved that the local FDR-based oracle procedure is optimal: among all methods controlling the marginal FDR (mFDR), the local FDR approach yields the highest number of rejections. While the power advantage is negligible for symmetric alternatives, it becomes significant when the alternative distribution is asymmetric. Motivated by these theoretical properties, we propose MLFDR, a local FDR-based screening algorithm designed specifically for high-dimensional mediation analysis. Our contributions to the literature are as follows:

We extend the concept of local FDR to the composite null hypothesis setting, deriving a screening rule with a closed-form expression for the corresponding false discovery proportion (FDP).We validate the method across a diverse array of data types—including continuous and binary variables, and scenarios with exposure-mediator interactions—demonstrating robust performance across various model specifications. We specifically incorporate Surrogate Variable Analysis (SVA) to adjust for latent confounding and illustrate the method’s efficacy in multiple mediator setups with univariate or clinical outcomes.MLFDR offers optimal power improvement over existing methods while maintaining asymptotic FDR control. Extensive simulations confirm its superiority over MDACT and HDMT in terms of power and error rate control.We provide theoretical guarantees for the identifiability and global optimality of our model under relatively mild assumptions, proving FDR control for both the oracle and adaptive procedures.

The remainder of this paper is organized as follows. Sect 2 contains the main results. Specifically, Sect 2.1 outlines the screening procedure for detecting significant mediators. Sect 2.2 presents simulation studies. Sect 2.3 discusses extensions of MLFDR to composite alternatives and latent factor models, which can account for unmeasured confounding and pleiotropy. Sect 3 provides an in-depth analysis of Prostate Cancer data and Lung cancer data from The Cancer Genome Atlas (TCGA), exploring SNP-CpG-gene expression pathways and causal pathways between smoking habits and gene expression, respectively. Sect 5 details the methodology. An R package implementing the method is available at https://github.com/asmita112358/MLFDR as well as in CRAN. Theorems proving the large-sample FDR control of MLFDR are provided in Section E of S1 Text.

Finally, we distinguish our approach from other recent efforts in high-dimensional mediation analysis. [33], [32], and [16] address the multiple mediator problem specifically for survival outcomes. Closer to our framework, [21] and [5] utilize local FDR-based rejection regions; the former approximates the alternative as a mixture of Gaussian distributions, while the latter constructs regions based on p-values. Our work advances this domain in two specific aspects: (i) we incorporate a general prior for the coefficients α and β to estimate the exact posterior density for computing the local FDR, rather than relying on approximations; and (ii) we offer theoretical guarantees for the local FDR estimates obtained via the EM algorithm, a property not previously established in this context.

2 Results

2.1 Method overview

This section outlines the workflow of MLFDR; a schematic representation of the framework is provided in Fig 1. Consider a study involving n independent samples. For each testing unit $[eqn]$ , we observe an exposure variable Xi, a mediator Mi, and an outcome Yi. Biologically, these variables may represent distinct contexts: for example, X may denote a patient’s smoking history (shared across i), with $[eqn]$ representing CpG methylation sites and $[eqn]$ representing gene expression levels. Alternatively, the analysis may focus on the functional impact of Single Nucleotide Polymorphisms (Xi) on gene expression (Yi) as mediated by CpG methylation (Mi) [3].

Schematic diagram of MLFDR.

The mediation model posits that the exposure Xi influences the outcome Yi through the intermediate variable Mi, rather than solely through a direct relationship. We denote the coefficient for the exposure-mediator relationship ( $[eqn]$ ) as $[eqn]$ , and the coefficient for the mediator-outcome relationship ( $[eqn]$ ) as $[eqn]$ . In Fig 1, solid arrows indicate these direct effects.

We aim to test the composite null hypothesis against the alternative for each unit i:

[eqn]

The composite null hypothesis H0,i can be decomposed into three disjoint component nulls, $[eqn]$ , defined as:

[eqn]

for $[eqn]$

We consider a mixture prior for $[eqn]$ , where the probability of each disjoint component nulls $[eqn]$ and H01 occur with probability $[eqn]$ and $[eqn]$ respectively. The marginal prior distributions of $[eqn]$ and $[eqn]$ , respectively, are degenerate zero under the null and follow a normal prior with an unknown mean and variance under the alternative. The marginal distribution of the least squares coefficient estimates $[eqn]$ given the latent states is computed, and the unknown parameters including the null proportions are estimated using EM algorithm.

Using these estimates, we compute the local false discovery rate (local FDR) for each coefficient pair, denoted as $[eqn]$ for $[eqn]$ . As the local FDR represents the posterior probability that the i-th hypothesis is null given the observed statistics, a lower value indicates stronger evidence against the null. Consequently, we define the rejection region for the composite null hypothesis as $[eqn]$ . The threshold δ is determined adaptively using the step-up procedure proposed by [22]. The complete algorithm is detailed in the Methods section.

Additionally, we introduce an extended algorithm (MLFDR2) which can deal with scenarios where the marginal priors of $[eqn]$ and $[eqn]$ follow mixture normal distributions under the alternative. A composite alternative leads to a joint distribution of $[eqn]$ with more than 4 mixture components, which can often be computationally burdensome. We introduce a two-step EM alogrithm that estimates the parameters of the marginal distributions of $[eqn]$ and $[eqn]$ in the first step, then uses these estimates to run another EM algorithm that computes the probabilities of each mixture component.

We also discuss another extension using Surrogate Variable Analysis [10] which can account for unmeasured confounders and pleiotropy in the model. Details are presented in Sect 2.3.

2.2 Simulation studies

We evaluate the performance of MLFDR through extensive simulations under two distinct mixture proportion scenarios: a dense alternative and a sparse alternative. Following the setups in [3], the latent class probabilities are defined as:

Dense alternative: $[eqn]$ Sparse alternative: $[eqn]$

We consider sample sizes of $[eqn]$ and fix the number of mediators at m = 1000. The parameter controlling the signal strength of mediation, τ, varies from 0.1 to 1.9 in increments of 0.2. The non-zero coefficients are generated as $[eqn]$ (under H10 and H11) and $[eqn]$ (under H01 and H11), where the noise terms follow $[eqn]$ and $[eqn]$ .

We compare the empirical FDR and power of MLFDR against two competing methods: MDACT [30] and HDMT (JS-mixture) [3]. The simulation settings are detailed below.

Linear Model. The exposure is univariate with $[eqn]$ .

[eqn]

where $[eqn]$ , and error terms $[eqn]$ . The results for this setting are summarized in Fig 2.Linear Model with measured Confounder. The exposure is univariate with $[eqn]$ . We introduce a confounder $[eqn]$ :

[eqn]

where the confounder effects are drawn independently from $[eqn]$ . The results are presented in Fig 3.Binary Outcome. The exposure is univariate with $[eqn]$ . The outcome Yi is binary:

[eqn]

The results are displayed in Fig 4.

FDR and power comparison for the linear model (Setting 1).Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

FDR and power comparison for the linear model with measured confounders (Setting 2).Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

FDR and power comparison for the linear model with binary outcomes (Setting 3).Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

Across all settings, the three methods demonstrated satisfactory FDR control. However, MLFDR consistently exhibited the highest power. Specifically, MLFDR achieved an average power improvement of 10.83% over MDACT and 12.23% over HDMT under dense alternatives. Under sparse alternatives, MLFDR maintained its advantage with an average improvement of 7.47% over MDACT and 8.51% over HDMT.

2.3 Extensions

2.3.1 Latent factors.

Unmeasured latent factors may be addressed by surrogate variable analysis [10]. Briefly, surrogate variable analysis considers the following model:

[eqn]

where X, Z are measured covariates, and $[eqn]$ are unmeasured latent factors. Surrogate variable analysis produces a set of K mutually orthogonal vectors $[eqn]$ (where $[eqn]$ ), which span the same linear space as the latent factors. Thus, the original equation may be re-written as:

[eqn]

These estimated factors, collected into a matrix $[eqn]$ , account for unmeasured confounding in the exposure-mediator relationship.

For the mediator-outcome relationship, the latent factors may be modeled as follows:

[eqn]

In this setting, the latent terms may account for: 1) unmeasured confounding; 2) measured confounders with unknown relationships to the outcome (e.g., global batch effects); and 3) pleiotropy, where Yi is influenced by mediators other than Mi.

Direct application of SVA to model (7) would require estimating surrogate variables for each mediator-outcome pair iteratively, leading to a computational bottleneck. To address this, we propose a global factor adjustment. We estimate a second set of surrogate variables, $[eqn]$ , based on the outcome null model (excluding mediators):

[eqn]

We then use the combined set of latent factors $[eqn]$ (derived from mediators) and $[eqn]$ (derived from outcome residuals) to model the mediator-outcome relationship. The validity of using $[eqn]$ from (8) relies on the assumption that any single mediator Mi contributes a relatively small amount of variance to the global outcome matrix Y.

The full procedure is summarized in Algorithm 1. We implemented surrogate variable analysis via the Bioconductor package sva [9].

Algorithm 1 Two-step global factor adjustment for high-dimensional mediation

Require: Outcome matrix $[eqn]$ , Mediator matrix $[eqn]$ , Exposure vector $[eqn]$ , Covariates $[eqn]$ .

Ensure: Adjusted estimates for $[eqn]$ and $[eqn]$ for $[eqn]$ .

Step 1: Estimate Latent Factors for Mediators ( $[eqn]$ )

1: Regress M on X and Z to obtain the residual matrix $[eqn]$ :

[eqn]

2: Perform SVA on $[eqn]$ to extract the top kM principal components:

[eqn]

Step 2: Estimate Latent Factors for Outcomes ( $[eqn]$ )

3: Regress Y on X and Z (excluding M) to obtain the residual matrix $[eqn]$ :

[eqn]

4: Perform SVA on $[eqn]$ to extract the top kY principal components:

[eqn]

Step 3: Mediation Analysis with Global Adjustment

5: for $[eqn]$ do

6: Mediator Model: Regress Mi on X, Z, and $[eqn]$ :

[eqn]

7: Outcome Model: Regress Yi on Mi, X, Z, $[eqn]$ , and $[eqn]$ :

[eqn]

8: Extract coefficients $[eqn]$ and their standard errors for MLFDR.

9: end for

We apply this method to two data generating scenarios to demonstrate its performance.

Unknown mediator-exposure interactions.

[eqn]

where S is a randomly selected subset of indices $[eqn]$ with $[eqn]$ , treated as unknown during model fitting. The results are shown in Fig 5.

FDR and power comparison for the linear model with unknown mediator-exposure interactions.Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

Unmeasured confounding and pleiotropy.

[eqn]

where Z1 and Z2 represent unmeasured confounders not included in the model fitting. S is a randomly selected subset of indices $[eqn]$ with $[eqn]$ . The term $[eqn]$ represents dense pleiotropy (the effect of other mediators on Yi), which acts as an additional source of unmeasured variation. The results are shown in Fig 6.

FDR and power comparison for the linear model with unmeasured confounders.Results are displayed for both sparse and dense alternatives. Gray ribbons indicate error margins.

2.3.2 Composite alternatives.

In this setting, the coefficients $[eqn]$ follow a Gaussian mixture distribution. The posterior distribution of the estimated coefficients is given by:

[eqn]

Under this framework, the null hypothesis corresponds to a mixture of 5 bivariate Gaussian distributions, while the alternative hypothesis comprises a mixture of 4 bivariate Gaussian distributions. The parameters are estimated using a two-step EM algorithm, the details of which are provided in Section C of S1 Text.

In our simulations, we set the mixture weights to $[eqn]$ and $[eqn]$ . The variance parameters were fixed at $[eqn]$ . The mean parameters were defined as functions of the mediation signal strength τ: $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ , where τ ranges from 0.1 to 1.9 in increments of 0.2.

Fig 7 presents the results for $[eqn]$ with m = 1000. All three methods maintained satisfactory FDR control, with MLFDR demonstrating the highest power.

FDR and power comparison for composite alternatives.Results are displayed for varying degrees of mediation (τ). Gray ribbons indicate error margins.

All methods appear to have satisfactory FDR control. MLFDR is uniformly more powerful than HDMT and MDACT in all cases, with better margin of improvement in dense alternatives.

3 Real data analysis

3.1 TCGA prostate cancer data

We apply MLFDR to the Prostate Cancer dataset from The Cancer Genome Atlas (TCGA), previously analyzed by [3]. The study involves mediation analysis for 147 prostate cancer risk SNPs, integrated with DNA methylation and gene expression data from 495 samples. For each risk SNP, we identified CpG methylation probes within a 500 kb window and recorded the gene expression levels for the corresponding probes. This resulted in $[eqn]$ SNP-CpG-Gene triplets for mediation testing.

In the first stage, we regressed CpG methylation on the SNPs, adjusting for the top 3 principal components (PCs) of genotypes, the top 15 PCs of CpG methylation, age at diagnosis, and pathological stage. From this, we obtained the slope estimates, variances, and p-values for the SNPs. In the second stage, gene expression was regressed on CpG methylation, conditional on the same set of covariates.

The estimated null proportion components $[eqn]$ were (0.51,0.033,0.41) for HDMT, compared to (0.39,0.004,0.59) obtained via the EM algorithm in MLFDR.

Due to the wide spread of the methylation coefficients (β), we fitted a composite alternative Gaussian mixture model. The number of components, d2 = 8, was selected based on the Akaike Information Criterion (AIC) (Fig 8). Conversely, the SNP coefficients (α) exhibited a narrower range (–0.2 to 0.4) and were adequately modeled using a d1 = 2 component Gaussian mixture.

AIC for methylation coefficients when a d2 + 1 component Gaussian mixture model was fit.

At an FDR threshold of 0.01, HDMT identified 137 triplets, MLFDR identified 187, and MDACT identified 180. Fig 9 displays a Venn diagram of the overlapping discoveries, along with the number of rejections across FDR cutoffs ranging from 0.001 to 0.05. MLFDR consistently detected more pathways than the competing methods.

SNP-CpG-Gene triplets identified by MDACT, MLFDR, and HDMT out of 69,602 tests.The Venn diagram (left) corresponds to an FDR cutoff of 0.01.

Table 1 lists 10 additional pathways identified by MLFDR (but missed by HDMT or MDACT), ranked by local FDR. Notably, six of these triplets involve rs12653946, a known prostate cancer risk variant that influences the expression of IRX4, a tumor suppressor gene in the prostate [6]. Another pathway involves rs7767188, a risk SNP associated with prostate cancer through the expression of TRIM26 [29].

Table 1: Top 10 additional pathways detected by MLFDR in TCGA Prostate Cancer Data (ranked by pmax).These pathways were not detected by HDMT or MDACT.

These empirical results align with our simulation studies: HDMT yielded the fewest discoveries, followed by MDACT, with MLFDR providing the highest detection rate. We note that the improvement of MLFDR over MDACT is modest in this application. This is likely attributable to the symmetric distribution of α and β; as noted by [22], local FDR-based tests offer limited power gains over p-value-based methods when the alternative distribution is symmetric around the null.

3.2 TCGA lung squamous cell carcinoma

We further extend our analysis to the TCGA Lung Squamous Cell Carcinoma (LUSC) dataset to investigate the mediating role of CpG methylation in the relationship between smoking history and gene expression. The data were acquired using the R package UCSCXenaTools [28]. We restricted the analysis to primary tumor samples, resulting in a sample size of n = 379 after preprocessing.

Smoking history was quantified by pack-years. Using the publicly available probe map data from the TCGA website, each CpG probe was mapped to the expression profiles of potentially multiple genes; each unique CpG-gene pair was treated as a distinct candidate mediation pathway. The analysis proceeded in two stages. In the first stage, CpG methylation beta values were regressed on smoking history (pack-years). In the second stage, a multiple linear regression was fitted for each gene, including all CpG probes mapped to that gene as predictors. The coefficients and p-values for each specific CpG-gene pair were then extracted to form the mediation hypotheses. All models were adjusted for potential confounders, including sex and the age at initial diagnosis. In total, 319,761 CpG-gene pathways were evaluated.

At an FDR threshold of 0.01, HDMT identified 13 pathways, MDACT identified 25 pathways, and MLFDR identified 44 pathways (Fig 10). The results highlight the increased power of MLFDR in detecting subtle mediation signals. Table 2 details the top 10 additional pathways detected by MLFDR (ranked by local FDR) that were not identified by the competing methods. Several of these findings are supported by existing literature. For instance, [27] discuss the relevance of WDR66 in lung cancer progression, while [2] highlights the role of LY6K as a potential therapeutic target in lung squamous cell carcinoma. [31] links TCIRG1 to metastatic potential of hepatocellular carcinoma.

Smoking-CpG-Gene pathways detected by MDACT, MLFDR, and HDMT out of 319,761 tests.The Venn diagram (left) displays overlaps at an FDR cutoff of 0.01, while the plot (right) shows detection counts across varying FDR thresholds.

Table 2: Top 10 additional pathways detected by MLFDR in the TCGA LUSC dataset (ranked by pmax).These pathways were not detected by HDMT or MDACT.

To evaluate the FDR of the three methods, we selected a subset of the data with $[eqn]$ tests, and permuted the samples to create a “global null” scenario. For each permutation, the smoking-Cpg and CpG-gene expression model is fit, and the three methods are implemented. At a given permutation, if any method has non-zero rejections, the False Positive rate for that permutation is recorded as 1 for that method. This procedure is repeated over 100 permutations, and the number of rejections is recorded for FDR levels varying from 0.0001 to 0.2. Fig 11 presents the proportion of cases in which each method came up with significant rejections under the global null. HDMT and MLFDR appear to have satisfactory control of the FDR, while MDACT appears to have some FDR inflation at low FDR levels.

Smoking-CpG-gene pathways detected by all methods under the global null out of m=1,000 tests.The proportion of cases reporting positive detection counts is reported across varying FDR thresholds. The gray solid line indicates target FDR levels.

4 Discussion

We have presented a flexible and powerful screening algorithm for detecting causal pathways in high-dimensional mediation analysis. MLFDR is capable of dealing with a wide array of outcome variables, confounders, and mediation exposure interactions. It can work in tandem with surrogate variable analysis to address complex dependence structures like pleiotropy and unmeasured confounding. Through simulations and theoretical analysis, we have shown that the proposed method is a viable alternative to p-value-based screening methods, which may not produce optimal results depending on the structure of the alternative hypothesis. We have also shown theoretical guarantees of FDR control for MLFDR. Overall, we hope that MLFDR will be used as an alternative to p-value based methods for mediator screening in the future.

5 Methods

5.1 The mediation model

Consider a univariate exposure X, a set of mediators $[eqn]$ , and m outcomes $[eqn]$ . We assume the following structural equation model:

[eqn]

where the error terms satisfy $[eqn]$ and are distributed as:

[eqn]

In settings with high-dimensional exposures, such as Single Nucleotide Polymorphisms (SNPs), we define the hypothesis based on the Exposure-Mediator-Outcome triplet $[eqn]$ . In such cases, the common exposure X in Eq (11) is replaced by the specific exposure Xi.

To describe the methodology, we focus on the formulation in Model (11). Given n independent samples $[eqn]$ , our goal is to test the joint significance of $[eqn]$ and $[eqn]$ for $[eqn]$ . Let $[eqn]$ , $[eqn]$ , $[eqn]$ , and define the projection matrix $[eqn]$ . The ordinary least squares (OLS) estimators are given by:

[eqn]

Conditional on X and $[eqn]$ , the estimators follow the distribution:

[eqn]

where $[eqn]$ and $[eqn]$ . Note that $[eqn]$ and $[eqn]$ are independent, as $[eqn]$ . We test the composite null hypothesis against the alternative:

[eqn]

Denote by $[eqn]$ the latent vector indicating the underlying truth of the i-th hypothesis, where $[eqn]$ and $[eqn]$ . The vector $[eqn]$ takes values in the set $[eqn]$ . Let $[eqn]$ represent the count of hypotheses in each state for $[eqn]$ .

We assume a prior distribution $[eqn]$ , where $[eqn]$ and $[eqn]$ . Conditional on the latent states, the non-zero effect sizes are modeled as Gaussian:

[eqn]

When $[eqn]$ , $[eqn]$ (and analogously for $[eqn]$ ). Assuming that $[eqn]$ and $[eqn]$ are independent conditional on $[eqn]$ , the marginal distribution of the coefficient estimates $[eqn]$ given the latent states is:

[eqn]

where $[eqn]$ denote the variances of the estimates $[eqn]$ conditional on $[eqn]$ .

We employ an Expectation-Maximization (EM) algorithm to estimate the unknown parameters $[eqn]$ ; details are provided in Section B of the S1 Text. Although MLFDR is derived under the framework of Model (11), it is applicable to a broader range of settings, including binary outcomes, mediator-exposure interactions, and models with confounders. The extensive simulations presented earlier demonstrate the method’s robustness across these varied scenarios.

5.2 Extension: Composite alternative

In the last section we introduced the latent variable $[eqn]$ to characterize the underlying state of the hypothesis. We now extend this framework to a composite alternative setting, where the non-zero effects are drawn from mixture distributions. Specifically, we assume $[eqn]$ are generated as follows:

[eqn]

[eqn]

Here, the index 0 denotes the null state, such that $[eqn]$ (i.e., a degenerate distribution at zero). As before, $[eqn]$ and $[eqn]$ are assumed independent conditional on the latent state $[eqn]$ .

Marginalizing over the prior distribution, the joint distribution of the estimators $[eqn]$ follows a Gaussian Mixture Model (GMM):

[eqn]

where $[eqn]$ . In this context, the component null hypotheses map to specific index combinations: H01,i corresponds to $[eqn]$ , while the alternative H11,i corresponds to $[eqn]$ .

The corresponding marginal distributions are given by:

[eqn]

where $[eqn]$ and $[eqn]$ .

Two-step EM algorithm.

Directly fitting a (d1 + 1)(d2 + $[eqn]$ -component bivariate GMM to estimate the joint probability matrix $[eqn]$ is computationally intensive. To mitigate this burden, we propose a two-step Expectation-Maximization (EM) algorithm.

In the first step, we estimate the parameters $[eqn]$ and $[eqn]$ using the univariate marginal distributions described in (Eq 16). While the marginal mixing proportions $[eqn]$ and $[eqn]$ are insufficient to recover the joint distribution $[eqn]$ (without assuming independence), the moment estimates remain valid.

In the second step, we fix these mean and variance estimates and fit a constrained bivariate GMM to the joint data solely to estimate the joint mixing proportions $[eqn]$ . This approach significantly reduces the dimensionality of the optimization problem. Details of this algorithm are provided in Section C of S1 Text, and a supporting simulation is presented in Sect 2.3.

This two-step approach is also applicable to the standard case where $[eqn]$ . In Section D of S1 Text, we compare the standard bivariate EM against this two-step variant. The results indicate that while the two-step method offers substantial computational speedups, it incurs a slight reduction in power. Therefore, we recommend using the standard bivariate EM for simple cases ( $[eqn]$ ) and reserving the two-step EM for complex composite alternatives where computational efficiency is paramount.

5.3 Step-up procedure based on local FDR

[22] demonstrated that for a simple null hypothesis, the oracle local FDR-based rejection region outperforms p-value-based thresholding. Specifically, it achieves a lower marginal False Non-discovery Rate (mFNR) while controlling the marginal False Discovery Rate (mFDR) at the same level. This advantage is particularly pronounced when the alternative distribution is asymmetric about the null. Motivated by these findings, we implement a local FDR-based step-up procedure to identify significant mediation pathways.

We focus our discussion on the Gaussian Mixture Model described in (14). The extension to the more general mixture model in (15) is straightforward. Conditional on the latent variables, the density function of the transformed statistics $[eqn]$ under state Hjk,i is given by:

[eqn]

where $[eqn]$ , $[eqn]$ , and $[eqn]$ denotes the bivariate normal density with mean vector $[eqn]$ , variances $[eqn]$ , and zero correlation.

For notational simplicity, let $[eqn]$ and $[eqn]$ . Under the composite null hypothesis, the joint local FDR is defined as:

[eqn]

This quantity can be estimated by substituting the parameter estimates obtained from the EM algorithm; we denote this estimate by $[eqn]$ . As the local FDR represents the posterior probability of the null hypothesis, a lower value indicates stronger evidence against the null. Consequently, we define the rejection region as $[eqn]$ , where the threshold δ must be determined to control the error rate.

We assume the cumulative distribution function (CDF) of the joint local FDR is given by:

[eqn]

where Gjk(t) is the conditional CDF of $[eqn]$ under hypothesis Hjk:

[eqn]

[eqn]

Oracle procedure.

The oracle procedure assumes that all parameters $[eqn]$ and densities fjk are known. For a given threshold δ, we define the following counting processes:

[eqn]

[eqn]

[eqn]

[eqn]

where $[eqn]$ represents the number of false rejections, $[eqn]$ the total number of rejections, and $[eqn]$ the number of missed discoveries (false negatives). We can express $[eqn]$ as:

[eqn]

Taking the expectation, we obtain:

[eqn]

The mFDR at threshold δ is defined as:

[eqn]

This quantity can be empirically estimated by:

[eqn]

In Section A of the S1 Text, we prove that for a fixed δ, the numerator and denominator of $[eqn]$ are unbiased estimators of the numerator and denominator of $[eqn]$ , respectively. The oracle rejection region for the composite null is defined as $[eqn]$ , where the threshold is selected as:

[eqn]

Adaptive procedure.

In practical applications, the true parameters $[eqn]$ and the component densities fjk are unknown. Consequently, the true local FDR values in Equation (28) are inaccessible. To address this, we substitute the unknown quantities with their estimates obtained via the EM algorithm. By plugging in these estimates, we obtain the estimated local FDR, denoted as $[eqn]$ , which yields an empirical estimate of $[eqn]$ :

[eqn]

Accordingly, the data-adaptive threshold $[eqn]$ is defined as:

[eqn]

The resulting procedure, which rejects the i-th hypothesis if $[eqn]$ , is operationally equivalent to the step-up algorithm detailed in Algorithm 2.

Algorithm 2 A data-adaptive procedure for finding the cutoffs in MLFDR.

Require: EM estimates for the parameters $[eqn]$

1: for $[eqn]$ do

2: Compute $[eqn]$

3: end for

4: Sort the estimated local FDR values in ascending order to obtain the set of order statistics $[eqn]$ , where ties are broken at random.

5: For each $[eqn]$ , compute the average estimated local FDR for the top k statistics:

[eqn]

6: Determine the cutoff index $[eqn]$

7: Reject the null hypotheses corresponding to the smallest k local FDR values, denoted as $[eqn]$

Supporting information

S1 FigComparison of empirical FDR and power across different methods for MLFDR and two-step MLFDR.(TIFF)

S1 TextSupplementary file detailing the methods and the theory.(PDF)

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Baron RM, Kenny DA. The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J Pers Soc Psychol. 1986;51(6):1173–82. doi: 10.1037//0022-3514.51.6.1173 3806354 · doi ↗ · pubmed ↗
2Chen Y, Zhou C, Zhang X, Chen M, Wang M, Zhang L, et al. Construction of a novel radioresistance-related signature for prediction of prognosis, immune microenvironment and anti-tumour drug sensitivity in non-small cell lung cancer. Ann Med. 2025;57(1):2447930. doi: 10.1080/07853890.2024.2447930 39797413 PMC 11727174 · doi ↗ · pubmed ↗
3Dai JY, Stanford JL, Le Blanc M. A multiple-testing procedure for high-dimensional mediation hypotheses. J Am Stat Assoc. 2022;117(537):198–213. doi: 10.1080/01621459.2020.1765785 35400115 PMC 8991388 · doi ↗ · pubmed ↗
4Dhar GA, Saha S, Mitra P, Nag Chaudhuri R. DNA methylation and regulation of gene expression: Guardian of our health. Nucleus (Calcutta). 2021;64(3):259–70. doi: 10.1007/s 13237-021-00367-y 34421129 PMC 8366481 · doi ↗ · pubmed ↗
5Ding J, Zhu X. Amdp: an adaptive detection procedure for false discovery rate control in high-dimensional mediation analysis. Advances in Neural Information Processing Systems. 2024;36.
6Nguyen HH, Takata R, Akamatsu S, Shigemizu D, Tsunoda T, Furihata M, et al. IRX 4 at 5p 15 suppresses prostate cancer growth through the interaction with vitamin D receptor, conferring prostate cancer susceptibility. Hum Mol Genet. 2012;21(9):2076–85. doi: 10.1093/hmg/dds 025 22323358 · doi ↗ · pubmed ↗
7Harris SE, Riggio V, Evenden L, Gilchrist T, Mc Cafferty S, Murphy L, et al. Age-related gene expression changes,, transcriptome wide association study of physical, cognitive aging traits and in the Lothian Birth Cohort 1936. Aging (Albany NY). 2017;9(12):2489–503. doi: 10.18632/aging.101333 29207374 PMC 5764388 · doi ↗ · pubmed ↗
8Lange T, Hansen JV. Direct and indirect effects in a survival context. Epidemiology. 2011;22(4):575–81. doi: 10.1097/EDE.0b 013e 31821 c 680c 21552129 · doi ↗ · pubmed ↗