A powerful framework for differential co-expression analysis of general risk factors
Andrew J Bass, David J Cutler, Michael P Epstein

TL;DR
This paper introduces a new method called KDCA to better detect gene expression changes linked to various disease risk factors.
Contribution
KDCA is a novel framework for differential co-expression analysis that works with all types of risk factors and reduces bias.
Findings
KDCA controls type I error and increases power over existing methods in simulations.
Applied to thyroid cancer data, KDCA found pathways related to age and BRAF mutations missed by other methods.
Abstract
Differential co-expression analysis (DCA) aims to identify genes in a pathway whose shared expression depends on a risk factor. While DCA provides insights into the biological activity of diseases, existing methods are limited to categorical risk factors and/or suffer from bias due to batch and variance-specific effects. We propose a new framework, Kernel-based DCA (KDCA), that harnesses correlation patterns between genes in a pathway to detect differential co-expression arising from general (i.e. continuous, discrete, or categorical) risk factors. Using various simulated pathway architectures, we find that KDCA accounts for common sources of bias to control the type I error rate while substantially increasing the power compared to the standard eigengene approach. We then applied KDCA to The Cancer Genome Atlas thyroid data set and found several differentially co-expressed pathways by…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3| Risk factor | Pathway | Tested size | KDCA | Eigengene |
|---|---|---|---|---|
| Age of diagnosis | IL2RB | 20 |
|
|
| RAS | 6 |
|
| |
| CTCF | 11 |
|
| |
| CDMAC | 6 |
|
| |
| GLEEVEC | 10 |
|
| |
| TFF | 12 |
|
| |
|
| PRION | 7 |
|
|
| EDG1 | 10 |
|
| |
| ECM | 12 |
|
|
- —National Institutes of Health10.13039/100000002
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks · Gene expression and cancer classification · Biotin and Related Studies
1 Introduction
A transcriptomic study measures the expression of thousands of genes simultaneously to help understand the molecular basis of diseases. In such studies, analyzing genes in the context of their associated pathways can reveal the underlying biological activity causing disease. In particular, a pathway that contributes to disease will co-regulate in normal conditions but, in the presence of an environmental or genetic risk factor, will result in a homeostatic breakdown. Such dysregulated pathways relate to the concept of decanalization (Gibson 2009, Lea et al. 2019), where pathway co-regulation evolves over many generations under stabilizing selection but is then disrupted by recent risk factors. Thus, discovering genes in a pathway whose shared expression (i.e. co-expression) differs across a risk factor can help uncover underlying biological mechanisms that induce dysregulation and cause disease susceptibility. As a result, differential co-expression analysis (DCA) of pathways has proven valuable in studies of Alzheimer’s disease (Ray and Zhang 2010, Seyfried et al. 2017), cancer (Grechkin et al. 2016, Wu et al. 2016), and type 2 diabetes (Murphy et al. 2008).
While identifying dysregulated pathways provides important biological insights, there are several inherent limitations to existing methods that limit their practical utility. In particular, there are methods to test for differential co-expression by assessing whether the correlation between pairs of genes in a pathway vary as a function of the risk factor [e.g. DiffCorr (Fukushima 2013), DGCA (McKenzie et al. 2016), DiffCoEx (Tesson et al. 2010), and GSCA (Choi and Kendziorski 2009)]. However, such approaches are unable to analyze continuous risk factors (e.g. age of diagnosis, body mass index, and/or antigen levels) as well as multiple risk factors simultaneously. These methods also have difficulty accounting for batch and variance-specific effects which can increase the number of false discoveries in a study.
Lea et al. (2019) developed a method, CILP, that addresses some of these limitations but only considered a pair of genes for DCA rather than an entire pathway. While CILP could theoretically be applied to all pairwise combinations of genes in a pathway, it does not leverage shared information across genes and requires a multiple testing correction. These factors reduce the power to detect differentially co-expressed pathways. An alternative method that leverages shared information across genes is the standard eigengene approach (Langfelder and Horvath 2007), which tests the top principal component of an appropriately adjusted gene expression correlation matrix. However, there is no reason to assume a priori that the top principal component captures the desired biological signal, as it depends on many factors, including signal sparsity, pathway size, and correlation structure (Aschard et al. 2014, Bass et al. 2024). As such, ignoring the lower-variance components can lead to power loss and the failure to detect a signal. Collectively, these limitations have restricted the application of DCA in genomic studies.
We develop a flexible kernel framework to test for differential co-expression in a pathway. Our proposed method, Kernel-based DCA (KDCA), is motivated by the observation that pathways are differentially co-expressed when the correlation patterns between genes are associated with a risk factor. KDCA harnesses such correlation patterns across all genes in a pathway while handling pathways of arbitrary size, multivariate sets of general (i.e. continuous, discrete, or categorical) risk factors, and sources of bias due to batch and variance-specific effects. Compared to the eigengene approach, KDCA does not ignore lower-variance components and leverages all information in a pathway to increase power when testing for differential co-expression.
We evaluate KDCA using comprehensive simulations that demonstrate type I error rate control under the presence of bias-inducing sources. We then show that, depending on the pathway architecture, KDCA can substantially increase power compared to the eigengene approach. Finally, we apply KDCA to The Cancer Genome Atlas (TCGA) thyroid data set (Agrawal et al. 2014) and discover differentially co-expressed pathways with respect to age of diagnosis and BRAF mutation status that were undetected by the eigengene approach. Overall, our results indicate that KDCA can flexibly test a pathway for differential co-expression using one or many risk factors while accounting for common sources of bias.
2 Materials and methods
2.1 Overview
Differential co-expression is a phenomenon that occurs when a risk factor influences the correlation between two or more genes (Fig. 1a; Supplementary Methods, available as supplementary data at Bioinformatics online). Intuitively, the existence of such differential correlation patterns stems from unobserved (or latent) co-expression involving the risk factor. For example, while the presence of observable factors (e.g. age or sex) can influence expression levels in a pathway, so can other latent factors that are difficult to identify and measure such as substrate abundance. The interaction between the risk factor and the latent factor(s) creates complex non-linear behavior that can be overlooked by only considering the main effect, but can be observed by studying the correlation patterns between genes in a pathway.
Overview of the KDCA framework. (a) The general gene expression model for a pathway with latent co-expression. (b) KDCA first calculates standardized residuals while accounting for the mean and variance effects from the risk factor, covariates, and batch effects. Kernel matrices are calculated using the risk factor(s) and estimates of the IGC to construct the test statistic. A permutation algorithm is then implemented to approximate the null distribution and construct empirical P-values while accounting for mean and variance effects.
KDCA is based on our earlier method LIT (Bass et al. 2024), which leveraged differences in cross-trait correlation patterns to identify single-nucleotide polymorphisms with latent interactive effects in biobank-scale genome-wide association studies. In the genomics setting, KDCA instead assesses differential co-expression by relating an individual’s risk factor(s) to individual-specific gene correlations (IGC) between genes in a pathway (Fig. 1b). We first estimate IGCs between pairs of genes by multiplying the residuals of different pairs of genes together, after adjusting for covariates, batch effects, and variance effects. We then test whether the elements of a matrix comprised of pairwise similarity of IGC terms is independent of the elements of a second matrix comprised of pairwise risk factor(s) similarity. The similarity between variables is measured with a user-specified kernel function, such as a linear kernel (analogous to scaled covariance), a projection kernel, or a Gaussian kernel. Since the optimal choice depends on pathway architecture (shown below), KDCA integrates multiple kernels to maximize discovery for general risk factors.
The KDCA framework differs from the LIT framework in two important ways. Firstly, LIT was developed for biobank-scale data sets rather than the modest sample sizes of gene expression data sets. Consequently, the resulting LIT test statistics rely on asymptotic theory that does not hold for these smaller data sets. In this work, we develop a novel permutation algorithm to calculate valid P-values for data sets of modest sample size. Secondly, LIT detected any changes in covariance from the risk factor, including from variance-specific effects. As variance-specific effects are not of interest in DCA, KDCA constructs the gene-by-gene correlations to avoid variance-specific false discoveries.
2.2 KDCA framework
Consider a pathway with expression value for individuals and genes. We assume that the expression values are influenced by a risk factor of interest, a set of measured covariates and batch effects, and a set of unobserved (or latent) factors (Fig. 1a). More formally, let be a vector containing a general (i.e. continuous, discrete, or categorical) risk factor, be a matrix of measured covariates, be a matrix of batch effects, and be a matrix of latent (unobserved) factors. For simplicity, we present our algorithm with respect to a single risk factor, but the framework can easily be extended to handle multiple risk factors (see simulation results). We assume that the expression matrix, , is approximately normally distributed: for RNA-seq data, this can be log counts per million (logCPM) or the logarithm of the expression value while adjusting for library size as a fixed effect. Our primary objective is to assess whether the risk factor interacts with the latent factors to influence expression in a pathway.
A risk factor that causes dysregulation in a pathway can be detected using estimates of the IGC. The gene-by-gene correlation estimates for each individual (i.e. the IGC) are constructed while accounting for the mean and variance effects from the risk factor, covariates, and batch effects. Let the standardized residuals be denoted by , where are the residuals adjusted (via least squares or double GLM) for the mean effects from the risk factor, covariates, and batch effects, and is the estimated conditional variance. Note that the subscript “ ” is a placeholder to represent all of the columns (or rows) of a matrix. The conditional variance can be estimated within each group (for a categorical risk factor) or modeled with a double GLM (for a continuous or discrete risk factor) (Smyth 1989). The variance effects of other variables can also be included within the double GLM procedure. Importantly, standardizing the residuals prevents false discoveries occurring from a risk factor with only variance effects. The pathways IGC terms are estimated by taking the cross products of the standardized residuals, i.e. .
With the cross products and risk factor, our inference goal can be summarized as follows. Under the null hypothesis of no differential co-expression, the estimates of the IGCs are independent of the risk factor:
where is the cross product of the ith pair of genes and “ ” denotes statistical independence. Intuitively, if the cross products are not independent of the risk factor, then it implies that there is some latent factor(s) that interacts with the risk factor to influence expression. While a promising strategy is to test each cross product separately (Lea et al. 2019), such an approach does not leverage shared information across tests and suffers a power loss by correcting for multiple testing.
To address these shortcomings, we implement a kernel distance covariance framework to test for differential co-expression in a pathway (Gretton et al. 2007, Székely et al. 2007, Zhang et al. 2011). The KDCA framework uses a kernel-based independence test called the Hilbert–Schmidt independence criterion (HSIC) (Gretton et al. 2007). The HSIC generalizes well-known testing procedures in statistics [e.g. the RV coefficient (Josse and Holmes 2016), distance covariance (Sejdinovic et al. 2013), and multivariate distance matrix regression (MDMR) (Hua and Ghosh 2015)] and has been used extensively in genomics (Kwee et al. 2008, Wu et al. 2010, 2011, Broadaway et al. 2016, Bass et al. 2024). We summarize the KDCA framework for general gene expression studies below.
2.2.1 Step 1: Constructing kernel matrices
The HSIC constructs a test statistic that measures the amount of shared signal between a similarity matrix for the risk factor and a similarity matrix for the cross products. To calculate a similarity matrix, a kernel function is specified to measure the similarity between the jth and *j′*th individual. There are many options for the kernel function, such as the linear kernel (equivalent to a scaled covariance) , the projection kernel , where is the left singular vectors of , and the Gaussian kernel , where is a tuning parameter (Schaid 2010, Broadaway et al. 2016). The optimal kernel depends on the structure (or architecture) of the pathway. For example, the Gaussian kernel is suitable for complex non-linear co-expression patterns while the linear and projection kernels are more suitable for linear co-expression patterns. However, even when linear relationships are expected, the choice between the linear and projection kernels is not clear [see Bass et al. (2024) for a thorough investigation]. Since the optimal kernel choice is unknown a priori in our setting, we evaluate the linear kernel, the projection kernel, and the Gaussian kernel for the cross products. For the risk factor, we only consider the linear kernel . We denote the similarity matrix of the cross products and risk factor as and , respectively. In the Gaussian kernel implementation, we set as it performed well across all simulation settings.
2.2.2 Step 2: Test statistic and inference
With the similarity matrices and , the HSIC test statistic measures the “overlap” between the two matrices as
where small values of T suggest independent matrices (no differential co-expression) and large values of T suggest that the pairwise elements of the matrices are dependent (differential co-expression). Under the null hypothesis, the HSIC test statistic follows a weighted sum of Chi-squared random variables, i.e. , where and are the ordered non-zero eigenvalues of the respective matrices and . For biobanks or other studies with large sample sizes, we can use a Gamma distribution (Gretton et al. 2007) or Davies’ method (Davies 1980) to approximate the asymptotic null distribution in a computationally efficient manner. However, there are not many expression studies that have sample sizes sufficiently large for this approximation to hold. Although there exists a small sample size approximation for a related test statistic that estimates the null distribution by matching the moments of the permutation statistics to a Pearson type III distribution (Zhan et al. 2017), the underlying assumptions do not hold in our setting since we remove the mean and variance effects of the risk factor from the gene expression values (shown in simulations).
We instead develop a permutation algorithm to estimate the empirical null distribution of the HSIC test statistic. For permutations, let denote a matrix that permutes the rows of the standardized residuals, i.e. is a permuted set of residuals that removes any differential co-expression but conserves the gene-by-gene correlations. Intuitively, since the permutation shuffles the expression values for each gene in the same order (row-wise), the correlation between genes does not change. Instead, the row-wise permutation removes any associations between the risk factor and the latent factors (i.e. co-expression) to generate a pathway without differential co-expression. Our permutation algorithm is summarized as follows.
For permutations, generate the pathway under the null hypothesis:
where .
Calculate the cross products and construct the corresponding kernel matrix . The null statistic is calculated as for .Using the observed statistic T and null statistic , calculate the empirical P-value according to
An important feature of the above algorithm is that the mean effects from the risk factor, covariates, and batch effects are preserved when generating the null pathway. Furthermore, we account for the influence of the risk factor on expression variance to avoid false discoveries (i.e. not differentially co-expressed) from factors with strong variance effects. While we emphasize a single risk factor here, the algorithm can be extended to incorporate multiple risk factors (whether continuous, discrete, categorical, or a mixture of all three), and also adjust for variance-specific effects from such factors.
2.3 Combining information across kernels to maximize power
KDCA requires choosing a kernel function for the cross products and risk factor. However, it is unclear which kernel is optimal a priori given that they capture different measures of co-expression similarity. Since one kernel may outperform the others, we consider an aggregate implementation to maximize the statistical power.
We first calculate the observed test statistic using the linear, projection, and Gaussian kernels (other kernels could be added). We then implement the permutation algorithm such that the null statistics and corresponding empirical P-values are calculated separately for each kernel. Calculating the empirical P-values before combining the kernels standardizes the null distributions of each kernel. We then construct a test statistic that combines the empirical P-values using Fisher’s method, i.e. , where is the empirical P-value for the dth kernel. To estimate the null distribution of this statistic, we combine the empirical P-values in permutation b as . Finally, an empirical P-value for the observed statistic is calculated using the combined null statistics .
By using the empirical null distribution of the combined statistic to construct P-values, our procedure leverages information across kernels while also being robust to any dependence between kernels. It is important to note that any P-value combination approach can be used, although we found Fisher’s method performed well in practice.
2.4 Simulation study
We evaluated the performance of KDCA using simulated data. We considered a categorical and continuous risk factor: the categorical variable was generated as and the continuous variable was generated as . For the pathway, we assumed the risk factor had the following effect on the mean and variance of each gene, and on the covariance between each pair of genes:
where is the intercept value; is an observed covariate; is the baseline correlation between each pair of genes; is the variance-specific effect size of the risk factor; and is the correlation-specific effect size of the risk factor.
We considered various configurations of differential co-expression due to the risk factor. We first set the sample size to be and pathway size to be or 50 genes. Note that the average pathway size of the BioCarta pathway database (Nishimura 2001) in our applied example is 8.3, and so reflects larger pathways. We then varied the proportion of genes in a pathway with no differential co-expression as 0, 0.20, 0.40, 0.60, and 0.80. This represents different sparsity levels of differential co-expression. We also considered a more complex pathway correlation pattern where the direction of the effect sizes was randomly assigned to be positive or negative (i.e. ). In total, we simulated 200 replicates at each configuration and the empirical power was calculated at a significance threshold of 0.05.
To evaluate type I error rate control under the null hypothesis, we considered two different settings. The first setting evaluated the null in the above simulations by assigning for all pathways. We also evaluated the performance of KDCA to model the variance-specific effects under the null by varying the effect size as where and 0.5. Note that we modeled the variance-specific effect within each group for the categorical variable and using a double GLM for the continuous variable. The second setting simulated 10 000 genes from a negative binomial distribution with the following parameters:
where is the baseline count size for the kth gene; is the effect size of the risk factor; is the effect size of the covariate; and is an interaction with latent factor and effect size . We scaled the counts to reflect varying library sizes where the total counts for each sample was randomly chosen to be or . The dispersion parameter of the negative binomial distribution was assigned to be 0.25 for each gene in order to reflect human data sets. We applied a log2 transformation to the counts with an offset of 2 and treated the estimated (log-transformed) library size as a fixed covariate. We note that the latent interaction is not shared across genes in the pathway and thus induces a variance effect from the risk factor when the data are transformed. In both settings, we simulated 20 data sets with 1000 pathways of size or 50. The type I error rate of KDCA was evaluated at a threshold of 0.05 with B = 1000 permutations.
To assess the performance of KDCA with multiple risk factors, we simulated risk factors as for factors. The expression values were generated by incorporating the risk factors into the above model as . We then applied KDCA to the multiple risk factors to assess the performance under a multivariate setting.
2.5 TCGA study
TCGA is a large-scale genomic program to characterize various cancer types in humans. For our analysis, we used the RNA-seq data from thyroid carcinoma (THCA) samples downloaded from recount3 (Wilks et al. 2021). We tested for evidence of differential co-expression using BRAF mutation status [extracted from TCGAbiolinks (Colaprico et al. 2016)] and age of diagnosis. The BRAF mutation is a prognostic marker for papillary thyroid cancer (PTC) and associated with overall worse prognosis (Tufano et al. 2012), while a patient’s age of diagnosis is a known risk factor for many cancers as mortality increases with age (de Magalhães 2013, Chatsirisupachai et al. 2021). We restricted our analysis to primary tumor samples and removed any formalin-fixed paraffin-embedded (FFPE) samples. We then removed non-protein coding genes and any genes with log2 counts per million (logCPM) values less than 1 in 50% of the samples. We also ensured that there were no overlapping genes in the data set. After filtering, we selected 5000 genes with the largest variation (logCPM) across 494 samples.
Following previous recommendations (Parsana et al. 2019), we estimated latent batch effects using the log2-transformed counts with an offset of 2 and found 28 latent variables in our data set (treated as adjustment variables). This is a conservative approach to differential co-expression testing as it may remove the biological signal of interest. However, it was previously shown to outperform models that explicitly modeled sources of confounding (e.g. RIN, exonic rate, or GC bias) in gene network reconstruction (Parsana et al. 2019), suggesting that there are diverse sources of confounding in expression data that are challenging to account for. We then used the canonical BioCarta pathways (Nishimura 2001) from the Molecular Signatures Database (Subramanian et al. 2005) where pathway sizes less than 5 were removed. In total, there were 162 pathways with an average tested set size is 8.3 (Fig. S1, available as supplementary data at Bioinformatics online). Finally, we implemented KDCA and the eigengene approach using 10 000 permutations.
3 Results
3.1 KDCA controls the type I error rate
We performed comprehensive simulations to assess the type I error rate of KDCA (Section 2.4). In particular, we simulated pathways of size 10 and 50 with a sample size of 300 under two settings. The first setting generated pathways assuming the data followed a multivariate normal data while the second setting generated pathways assuming the data followed a negative binomial distribution. In the latter setting, we applied a log2 transformation to the counts and treated library size as a fixed effect. In both cases, the continuous and categorical risk factor impacted the mean and variance of the expression values.
We find that our permutation algorithm provides type I error rate control in either the normal or negative binomial simulation cases, regardless of the kernel choice or risk factor type (Figs S2 and S3, available as supplementary data at Bioinformatics online). Additionally, our implementation that combines information across the different kernels also provided type I error rate control. Furthermore, our results show that adjusting for the variance-specific effect of the primary variable provides well-behaved P-values under the null hypothesis for the continuous and categorical variables (Fig. S4, available as supplementary data at Bioinformatics online). Alternatively, as expected, not adjusting for the variance effects can lead to an inflated type I error rate. Finally, when there are multiple risk factors included in the model, we find that all of the KDCA implementations control the type I error rate (Fig. S5, available as supplementary data at Bioinformatics online). Thus, our permutation approach, which models the mean and variance effects from the risk factor, controls the type I error rate in the presence of variance-specific effects for continuous or categorical risk factors.
The empirical power of KDCA using the linear (black), Gaussian (green), and projection (blue) kernels when the primary variable is (a) continuous and (b) categorical. We also compared an implementation of KDCA that maximized power across all kernels (combined; red) to a standard eigengene (grey) approach. Our simulation study varied the pathway size (columns) and the type of differential co-expression (rows). Each point is the empirical power from 200 simulations at a significance threshold of 0.05.
The number of detections as a function of FDR threshold using the BioCarta pathway database in the TCGA thyroid cancer data set. We compared an aggregate version of KDCA (red) to a standard eigengene approach (black) when the primary variable was age of diagnosis (left) and BRAF mutation status (right). There were a total of 162 pathways tested for differential co-expression.
We also evaluated an approximation of the theoretical null distribution for a related test statistic (Zhan et al. 2017) in the normal setting. For categorical risk factors, we find that the type I error rate is very conservative for small and large pathway sizes (Fig. S6, available as supplementary data at Bioinformatics online). Alternatively, for continuous risk factors, the type I error rate is not controlled. Note that the mean and variance effects of the risk factor are adjusted from the genes before kernel construction. This violates a key assumption of the approximation, i.e. permuting one kernel does not impact the other kernel under the null. While there is a large sample size approximation to the KDCA test statistic (Davies 1980), most expression studies do not have large enough sample sizes for this approximation to hold. As such, we find that our permutation-based approach for calculating valid P-values is required in this setting, even though it increases the computational time (see Fig. S7, available as supplementary data at Bioinformatics online, for computational time comparisons).
3.2 The optimal kernel choice depends on pathway architecture
We assessed the power of KDCA in our simulation study using various kernel choices. We varied the proportion of genes that were differentially co-expressed in a pathway in the categorical and continuous risk factor settings. We considered a “Positive” scenario where the effect size direction of the latent co-expression was the same and a “Positive/Negative” scenario where the direction was randomly positive or negative. We applied KDCA using a linear kernel, a Gaussian kernel, a projection kernel, and an implementation that aggregates the statistics using Fisher’s method. Importantly, the first three KDCA implementations are different measures to capture the signal while the last attempts to maximize the power by leveraging information across the kernels.
Our simulation results suggest that the best kernel choice depends on the pathway size, sparsity of the signal, and correlation structure (i.e. “Positive” or “Positive/Negative”; Fig. 2). In particular, for a continuous risk factor, we find that the Gaussian and linear kernels outperform the projection kernel (Fig. 2a). However, for a categorical risk factor, the projection kernel outperforms the Gaussian and linear kernel in the “Positive/Negative” setting with a pathway size of 10 (Fig. 2b).
Since the best performing kernel is unclear, we implemented the KDCA combined approach that aggregates information across the three kernels to maximize statistical power. We find that KDCA combined provides the best performance for maximizing power across the various settings. We also found similar results when considering multiple risk factors (Fig. S8, available as supplementary data at Bioinformatics online).
3.3 KDCA improves power compared to the eigengene approach
We compared KDCA to the standard eigengene approach that tests the top eigenvector of the cross product matrix (adjusting for the mean and variance effects of the primary variable). To construct an empirical P-value, we apply the same permutation algorithm as KDCA. In general, the linear kernel outperforms the eigengene approach because it aggregates information from all pairwise correlations of genes within a pathway (Fig. 2 and Fig. S8, available as supplementary data at Bioinformatics online). Similarly, the Gaussian kernel tends to outperform the eigengene approach. Finally, while the projection kernel outperforms the eigengene method in the “Positive/Negative” setting, it underperforms in the “Positive” setting.
Our findings can be interpreted by considering the singular value decomposition of the cross product linear kernel matrix. In particular, the eigengene approach is equivalent to testing the top eigenvector of the linear kernel matrix. Therefore, when the eigengene approach performs comparably to KDCA with a linear kernel, it suggests that the signal of interest is captured by the top eigenvector. Alternatively, when the linear kernel outperforms the eigengene approach, it suggests that the signal is captured at the lower eigenvectors. For example, in the “Positive/Negative” setting where the signal is captured by lower eigenvectors, KDCA with a linear kernel can substantially outperform the eigengene approach. While the power of the eigengene approach improves in the “Positive” setting where the signal is expected to be captured by the top eigenvector, it is still lower than KDCA with a linear kernel. This suggests that the signal is also captured by the lower eigenvectors. Note that, when the signal is only captured by a small number of eigenvectors, the projection kernel can have the lowest power because it treats the eigenvectors in the linear kernel matrix equally.
3.4 KDCA reveals differentially co-expressed pathways in TCGA analysis
Thyroid cancer is a common cancer with an incidence rate that has been increasing over the past few decades (Jung et al. 2014). While the overall survival rate is very promising with early treatment, the cancer is heterogeneous and certain subtypes can be lethal (Agrawal et al. 2014). To help elucidate biological variation not captured by traditional differential expression analysis, we applied the KDCA framework to TCGA thyroid cancer data set. Our analysis focused on papillary thyroid carcinomas (PTCs) samples with two risk factors: BRAF mutation status (a categorical risk factor) and age of diagnosis (a continuous risk factor).
We find that KDCA combined is substantially more powerful than the eigengene approach, in line with our simulation study (Fig. 3; Table S1, available as supplementary data at Bioinformatics online). In particular, at an FDR threshold of 0.15, the eigengene approach does not detect any differentially co-expressed pathways while KDCA combined detects three and six pathways for BRAF mutation status and age of diagnosis, respectively. The individual kernel implementations also outperform the eigengene approach (Fig. S9, available as supplementary data at Bioinformatics online), demonstrating that KDCA benefits from aggregating all the information from the cross product matrix.
When focusing on the discoveries from KDCA combined at an FDR threshold of 0.15 (Table 1), we find that the differentially co-expressed pathways are involved in cellular proliferation, the immune microenvironment, and/or are known prognostic markers. In particular, when the risk factor is age of diagnosis, the IL2RB, RAS, and TFF pathways are differentially co-expressed. The IL2RB pathway is a set of genes involved in immune-related signaling for cellular activation and growth. The RAS signaling pathway is involved in the immune microenvironment and the RAS mutation is a prognostic marker (Xing 2016). Finally, the TFF pathway promotes cellular proliferation and can be a key component in the MAPK signaling pathway (Zhang et al. 2019). Importantly, immune-related pathways are expected as inflammation is a known hallmark of aging and can have tumor promoting effects (de Magalhães 2013).
When the risk factor is BRAF mutation status, we find EDG1, PRION, and ECM pathways are differentially co-expressed. These pathways are known to affect cell growth and survival in cancer (Liu et al. 2000, Popova and Jücker 2022, Abi Nahed et al. 2023). In addition, we find that MAPK1, PIK3CA, PIK3CG, and PIK3R1 are the top four most occurring genes when the risk factor is BRAF mutation status, while for age of diagnosis the top four are HRAS, PIK3CA, PIK3CG, and PIK3R1. The MAPK1, HRAS, and PIK3CA genes are associated with distinct thyroid cancer subtypes (Xing 2013), suggesting the intriguing hypothesis that the unobserved latent factors are germline or somatic changes in these genes themselves. Overall, we find distinct co-expression patterns among pathways with respect to BRAF mutation status and age of diagnosis, indicating that the cancer samples are functionally heterogeneous.
4 Discussion
Our proposed kernel-based framework, KDCA, harnesses correlation patterns in a pathway to test for differential co-expression. KDCA can be applied to one or many general risk factors while accounting for covariates, batch effects, and variance-specific effects. We found that the optimal kernel choice for KDCA depends on the pathway size, correlation structure, and sparsity level. This demonstrates the complexity of detecting differential co-expression and agrees with our previous work which found the optimal kernel can also vary based on other features, such as sample size, signal-to-noise ratio at each principal component, and baseline correlation between genes [see Bass et al. (2024) for a thorough investigation]. Since these factors are unknown a priori, we developed an implementation to maximize the power by aggregating information across kernels.
One important feature of KDCA is that it helps prevent false discoveries from variance-specific effects of the risk factor. In particular, our permutation algorithm models the mean and variance effects for general risk factors to control the type I error rate. While KDCA can handle continuous and discrete risk factors, it requires careful specification of how the factor influences the variance in the double GLM. Otherwise, it is possible that variance-specific effects from the risk factor can lead to false discoveries. More generally, we caution against rank-based transformations of expression data as these can induce variance and covariance-specific effects and lead to false discoveries.
There are a few of important observations to consider when applying the KDCA framework. First, pathways should be carefully selected in order to avoid a multiple testing problem. In this work, we focused on the BioCarta pathway database and a subset of the 5000 most varying genes. Second, unmeasured batch effects impact the covariance between genes in a pathway, leading to false discoveries when such effects are associated with the risk factor. While a method has been proposed to capture batch effects for specific types of pathway networks (Parsana et al. 2019), such an approach can also remove the biological signal of interest. In general, accounting for latent confounding variables is a very difficult problem for differential co-expression methods. As such, even though KDCA provides a framework to incorporate batch effects, false discoveries due to latent non-biological sources of variation cannot be ruled out. Third, the minimum number of permutations to control the type I error rate is , where R is the number of hypotheses and is the significance threshold. We recommend performing substantially more permutations than this minimum value to achieve adequate statistical power in KDCA. Finally, the computational time of KDCA increases as the pathway size, sample size, or the number of permutations increases (Fig. S7, available as supplementary data at Bioinformatics online). If running a large number of pathways with a sample sizes much larger than considered here ( ), distributing pathways across multiple cores will increase computational efficiency. In such cases, stopping the algorithm early for pathways with little evidence of differential co-expression (e.g. pathways with an empirical P-value larger than 0.1 using 1000 permutations) may also help to reduce the computational time. Notably, for biobank-scale data sets, the closed-form approximation of the null distribution can be used instead of the permutation algorithm to substantially decrease the computational time (Davies 1980, Wu et al. 2011, Broadaway et al. 2016).
DCA provides biological insights into expression dysregulation of a pathway. However, there are many methodological challenges that have limited the utility of such approaches, such as testing continuous or discrete risk factors, avoiding false discoveries from variance-specific effects, and modeling batch effects. We developed the KDCA framework to help address these limitations, thereby providing a powerful framework to identify differentially co-expressed pathways in expression studies.
Supplementary Material
btaf565_Supplementary_Data
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abi Nahed R , Safwan-Zaiter H, Gemy K et al The multifaceted functions of prion protein (PRPC) in cancer. Cancers (Basel) 2023;15:4982.37894349 10.3390/cancers 15204982 PMC 10605613 · doi ↗ · pubmed ↗
- 2Agrawal N , Akbani R, Aksoy BA et al Integrated genomic characterization of papillary thyroid carcinoma. Cell 2014;159:676–90.25417114 10.1016/j.cell.2014.09.050PMC 4243044 · doi ↗ · pubmed ↗
- 3Aschard H , Vilhjálmsson BJ, Greliche N et al Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. Am J Hum Genet 2014;94:662–76.24746957 10.1016/j.ajhg.2014.03.016PMC 4067564 · doi ↗ · pubmed ↗
- 4Bass AJ , Bian S, Wingo AP et al Identifying latent genetic interactions in genome-wide association studies using multiple traits. Genome Med 2024;16:62.38664839 10.1186/s 13073-024-01329-0PMC 11044415 · doi ↗ · pubmed ↗
- 5Broadaway KA , Cutler DJ, Duncan R et al A statistical approach for testing cross-phenotype effects of rare variants. Am J Hum Genet 2016;98:525–40.26942286 10.1016/j.ajhg.2016.01.017PMC 4800053 · doi ↗ · pubmed ↗
- 6Chatsirisupachai K , Lesluyes T, Paraoan L et al An integrative analysis of the age-associated multi-omic landscape across cancers. Nat Commun 2021;12:2345.33879792 10.1038/s 41467-021-22560-y PMC 8058097 · doi ↗ · pubmed ↗
- 7Choi Y , Kendziorski C. Statistical methods for gene set co-expression analysis. Bioinformatics 2009;25:2780–6.19689953 10.1093/bioinformatics/btp 502PMC 2781749 · doi ↗ · pubmed ↗
- 8Colaprico A , Silva TC, Olsen C et al TCG Abiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res 2016;44:e 71.26704973 10.1093/nar/gkv 1507 PMC 4856967 · doi ↗ · pubmed ↗
