A parametric bootstrap approach for computing confidence intervals for genetic correlations with application to genetically determined protein-protein networks
Yi-Ting Tsai, Yana Hrytsenko, Michael Elgart, Usman A. Tahir, Zsu-Zsu Chen, James G. Wilson, Robert E. Gerszten, Tamar Sofer

TL;DR
The paper introduces a new method to compute confidence intervals for genetic correlations, especially useful for large-scale studies of protein-protein interactions.
Contribution
A parametric bootstrap procedure is proposed to improve confidence interval estimation for genetic correlations in large omics datasets.
Findings
The parametric bootstrap method outperforms asymptotic approaches in small samples or at parameter boundaries.
The method is demonstrated on a proteomics dataset with tens of thousands of protein pairs.
The procedure uses Haseman-Elston regression and kinship matrices to estimate genetic correlations.
Abstract
Genetic correlation refers to the correlation between genetic determinants of a pair of traits. When using individual-level data, it is typically estimated based on a bivariate model specification where the correlation between the two variables is identifiable and can be estimated from a covariance model that incorporates the genetic relationship between individuals, e.g., using a pre-specified kinship matrix. Inference relying on asymptotic normality of the genetic correlation parameter estimates may be inaccurate when the sample size is low, when the genetic correlation is close to the boundary of the parameter space, and when the heritability of at least one of the traits is low. We address this problem by developing a parametric bootstrap procedure to construct confidence intervals for genetic correlation estimates. The procedure simulates paired traits under a range of heritability…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic Mapping and Diversity in Plants and Animals · Genetic Associations and Epidemiology · Bioinformatics and Genomic Networks
Introduction
Genetic correlation measures the relationship between a pair of traits through their shared genetic variability.1 It is a related concept to heritability, which measures the overall genetic contribution to a trait.2 Specifically, genetic correlation is defined as the correlation between the genetic effects of two traits. It can be estimated using individual-level data or using summary statistics from genome-wide association studies (GWASs).3 Scientific papers studying the genetic architecture of health and behavioral phenotypes now routinely report genetic correlation estimates between phenotypes, sometimes as a step preceding follow-up analysis, e.g., with polygenic risk scores or Mendelian randomization analyses.4^,^5^,^6^,^7 Genetic correlations are further being studied at the local genomic region level (local genetic correlations), or stratified by genetic annotations, to localize sources of shared genetic underpinnings of phenotypes.8^,^9^,^10^,^11^,^12
Methods for estimating heritability and genetic correlations based on summary statistics from GWASs3^,^13^,^14 became popular in recent years due to their computational tractability and the access to many phenotypes that were interrogated in GWASs by the research community. However, in diverse populations and in small datasets, it is still preferable to estimate heritabilities and genetic correlations using individual-level data, rather than based on GWAS summary statistics.15 Methods using individual-level data typically rely on an underlying linear mixed model (LMM) formulation, where a genetic relationship matrix is used to model the relationship, or degree of similarity, between the phenotype levels of different individuals.16^,^17 When estimating the genetic correlation between two phenotypes, a bivariate normal model is usually used. Common algorithms for estimating heritability and genetic correlation include restricted maximum likelihood (REML)-based normal likelihood models,18 and method of moment estimators such as the Haseman-Elston approach.15 Estimation of standard errors (SEs), confidence intervals (CIs), and p values, often relies on asymptotic normal approximation. However, both heritability and genetic correlations have a limited support: heritability is bounded on the [0,1] interval, and genetic correlation on the [−1,1] interval. This means that asymptotic normal approximation may not be appropriate when estimates are close to the boundary of the parameter space, and the problem is more severe with smaller datasets. Previous publications addressed the problem of CI estimation in the context of heritability,19^,^20 but, although the distribution of genetic correlations has been studied,21^,^22^,^23 methods for CI computation in the era of large-scale genomic studies have not been as developed. Notably, we previously proposed a Fisher’s transformation-based approach and a blocked bootstrap, relying on resampling from the data, by blocks of related individuals.15 The blocked bootstrap worked better than the Fisher’s transformation approach, but was computationally more intensive; we, therefore, only allowed for a small number of resamples, limiting the potential coverage of the CIs as well as application at scale (i.e., for millions of traits). Here, we build on a prior work by Schweiger et al.19 in the context of heritability. We expand their parametric bootstrap test-inversion method that eliminates the dependency on asymptotic approximation.
In this paper, we develop a parametric bootstrap approach to construct CIs for genetic correlations to better model the unknown distribution of genetic correlations. The procedure requires simulating pairs of phenotypes using the existing correlation structure between individuals in a given dataset, based on sets of values of heritabilities and genetic correlation between the phenotypes. The results from the simulation study are used to construct CIs for the genetic correlation parameter based on triplets of estimated heritabilities and the genetic correlation of a pair of phenotypes, using the conditional empirical probability mass function (PMF) of the genetic correlation parameter. We demonstrate and compare, through simulations, the performance of two variations of the parametric bootstrap procedure and further compare them with construction of CIs based on the Fisher’s transformation of the estimated genetic correlation, and estimated SEs of the correlation parameter from asymptotic normal assumption on REML estimates. Despite being a resampling method, typically requiring many computations and thus computationally costly, our approach is very useful when estimating genetic correlations between thousands of traits measured on the same dataset, because the simulation study used to construct PMFs is performed once and may be used many times. Thus, we demonstrate the application of the parametric bootstrap approach to study genetic correlations between a high-dimensional set of proteins and to develop protein-protein networks based on the genetic correlations estimated in the Jackson Heart Study (JHS) dataset.
Subjects, material, and methods
LMM formulation
Let be an phenotype outcome vector and be an matrix containing values of covariates measured on participants. Let be an vector of residuals, or errors, which we assume are potentially correlated across participants due to shared genetic effects. Suppose that the matrix models the genetic relationship between individuals, such that its entry is, for example, (twice) the kinship coefficient between individual and , and (i.e., this is a symmetric matrix). Note that genetic relationship could be estimated by various quantities,24 without loss of generality; we here assume that we use a kinship matrix using identity by descent estimates. Following standard LMM formulation of heritability, we model the outcome as
where are the regression coefficients of the covariates, here treated as nuisance parameters, is a matrix of genetic variants, and is the vector with variant effects on the outcome. In the Supplementary Note, we demonstrate how genetic effects contributing to the phenotypic variance, so that, given covariate effects, the total variance is decomposed to genetic variance and remaining residual variance. Let be the genetic variance component, and be the residual variance component, so that
The narrow-sense heritability, defined as the proportion of total variance explained by additive genetic factors is:
Here, we assume, without loss of generality, that the total variance . This is a reasonable assumption because the heritability does not depend on the value of the total variance and the outcome can be rescaled to have total variance of 1 and heritability estimates are not affected (as both and are divided by the same value). Therefore, we have , meaning that the genetic variance is equal to the heritability. Under this assumption, the variance of the phenotype can be written as
Given two vectors , their covariance can be modeled as
where is the genetic variance for phenotype , is the residual variance for phenotype , is the genetic correlation between the two phenotypes, and is the residual correlation between the two phenotypes.15 More details are provided in the Supplementary Note. Alternatively:
If we further plug in , then, for a single and for two phenotypes, we can write the variance model as:
which is the form that we will use to simulate outcomes in the following parametric bootstrap section.
Parametric bootstrap
We use a parametric bootstrap approach to compute CIs. In brief, we simulate data for each set of potential values of heritabilities and genetic and residual correlation between two phenotypes based on the existing genetic relationship between individuals in the dataset of interest. Next, we compute CIs by inferring ranges of potential values of (integrated over potential values of , as the true values are not known) that resulted in realized (estimated) values In practice, to limit computational burden, we fix (and we assess the use of also as fixed values in simulations). Figure 1 visualizes the parametric bootstrap approach for computing CIs: the first step is the parametric bootstrap itself, where simulations are performed based on set heritabilities and genetic correlation values and results from all simulations are collected and stored. The second step is the computation of CIs for a given triplet of estimated heritabilities and genetic correlation values, based on a PMF computed from the stored parametric bootstrap results.Figure 1. Parametric bootstrap approach for generating CIs for genetic correlationsThe figure visualizes the parametric bootstrap approach for computing CIs for genetic correlations. The parametric bootstrap itself simulates data from the correlation structure (modeled via the genetic relationship, or kinship, matrix), applies the analytic procedure for computing heritabilities and genetic correlations, and stores the results. The CI computation step relies on the results from the parametric bootstrap. It estimates CIs based on parameter values that may result in the observed associations.
Step 1: Random sampling of genetically correlated outcomes
For every given combination of the potential heritability of phenotype 1 ( ), potential heritability of phenotype 2 ( ), and potential genetic correlation ( ), we draw (e.g., 10,000) pairs of phenotype vectors ( ) from the multivariate normal distribution
where
where and . We note here that may take potential value in the interval , but we choose just one value, as mentioned earlier. We used 10 settings for , 10 settings for , and 20 settings for as follows:
Under this setup, there are 2,000 distinct combinations of triplets ( , ) in total. We note that while developing this procedure we compared using finer grids of values, with sequences with differences of 0.01 between each two consecutive values, but the results remained essentially the same while the computational burden was substantially higher. Because the grid size determines the accuracy level of potential CI coverage, we later offer a solution to increase coverage without simulating a finer grid of values.
Step 2: Genetic correlation and heritability estimation
Next, based on each sampled pair of phenotype vectors ( ), we estimate , . While the procedure is in principle naive to the specific formula used, we are using the closed-form Hasemen-Elson formulas we previously derived15^,^20:
where is either the kinship matrix with all diagonal values set to zero or a weighted sum of the kinship matrix and the matrix modeling the random error (here, an identity matrix) with weights related to the relationship between the kinship matrix and the identity matrix. See Elgart et al.15 for more details, including the potential use of multiple matrices modeling correlations between individuals. In practice, it is appropriate to use the kinship matrix with the diagonal value set to zero when only the kinship matrix is used to model the relationship between individuals. Using these formulas rather than likelihood-based procedures is computationally quicker, as no iterations are required.
Step 3: PMF estimation for Pr(ρk|hˆ12,hˆ22,ρˆk2)
We now derive the expression for the conditional probability of given the estimated parameters. Because the support of is continuous, where and , while the results from simulations are discrete values, we divide these ranges into bins, e.g., of size 0.1, i.e., forcing them into a discrete distribution:
When estimating CIs for genetic correlations, we are given the estimates , and we want to identify a region such that (e.g., = 95%). Therefore, we want to estimate the probabilities for to create an empirical PMF and use it to generate CIs, which can be derived using Bayes theorem. The derivation below uses the probabilities , which are the probabilities of being in given regions conditional on the fixed values of (note that the probabilities of the estimated heritabilities do not depend on the heritabilities of other traits on or the genetic correlation between them). Moving forward, we drop the notations showing that values refer to bins (regions) for brevity, with the understanding that all probabilities refer to parameters being in bins. Therefore, we will denote instead of , etc.
We first note that , and therefore, we need to estimate and
Estimating Pr(ρˆk,hˆ12,hˆ22)
We estimate based on the following:
where, for bins of length 0.1, = 20, = 10.
Estimating Pr(ρ˜k,ρˆk,hˆ12,hˆ22)
We estimate based on the following:
Putting these together:
(computed for each pre-defined region) is then the empirical PMF of obtained by parametric bootstrap.
Computing CIs from the PMF
After obtaining the empirical PMF from parametric bootstrap, we can now derive the CIs for any given genetic correlation estimate with a coverage probability of (e.g., 95%). Because the parameters are bounded, constructed CIs may be asymmetric in both the distance between the estimated to the low and high values of the CI, and in the cumulative probability between provided by the two “sides” (around ) of the CI. We address this by considering the following three cases depending on the cumulative probability:
Here, denotes cumulative probability of potential values lower or equal to the estimated Denote the low and the high values of the CI for by and . Then a 1- CI news to include all potential values such that:
Case 1: If
Here, corresponds with the first potential value (i.e., a point in the first considered bin, where bins are considered by order ), where . corresponds with the smallest potential value satisfying equation.6
Case 2: If
In this case, we first identify as the highest (i.e., a point in first considered bin, where bins are considered by order ) with . corresponds to the highest potential value satisfying equation.6
Case 3: Both and
Here we require to be the largest value and to be the lowest such that
When the upper bound or lower bound of CIs obtained from the above procedure lies somewhere inside the bins defined by the grid of considered values, which is often the case, we use linear interpolation to get a position for upper and lower bound as point within the bins.
Empirical beta approximation to the PMF for CI estimation
Because the PMF is discrete, it limits the potential coverage of constructed CIs and the potential computation of accurate p values. Thus, we study a continuous beta approximation to the empirical PMF from parametric bootstrap. The goal is to use this approach when CIs with high coverage are sought, and when low p values need to be inferred (see below). Since the range of genetic correlation is [−1, 1], and the range of beta distribution is [0, 1], we first map the [−1, 1] range of genetic correlations to the [0, 1] range of beta distribution through a location-scale transformation. After finding the 100∗( )% CIs of on the beta scale using a similar approach to that reported based on the discreate PMF, we apply the inverse location-scale transformation from [0, 1] to [−1, 1] to retrieve the CIs of genetic correlations.
p Value estimation
We evaluate the use of the CI inversion method to obtain p values for hypothesis testing. Here, our null hypothesis H_0_ is , and our alternative hypothesis H_1_ is . Given any realization ( ), we can estimate the CIs for using the parametric bootstrap procedure, focusing on the continuous beta approximation to the empirical PMF because smaller accurate p values can be obtained if the underlying distribution is continuous. To determine the p values of genetic correlation estimates, we use the CI inversion methods. Suppose that we construct a 100 CI. Then, we can determine that the p value is smaller than if the constructed CI does not cover 0. For computational efficiency, we implemented a method that constructs CIs using a binary search approach to the value, stopping when a pre-defined sensitivity level is reached. In more detail, the binary search relies on the idea of choosing and value, computing a CI, and checking whether 0 is within the interval, i.e., it relies on a function that gets an value and return “yes” or “no” (1 or 0) if the CI does or does not cover zero. The search is to find the smallest value where the CI does not cover zero.
The JHS
The JHS is a longitudinal study following 5,306 individuals of African American background from Jackson, Mississippi.25^,^26 The study population includes 2,050 related and unrelated JHS participants who had whole genome sequencing (WGS) through the Trans-Omics for Precision Medicine (TOPMed) program, proteomics data, and available body mass index (BMI). The TOPMed Data Coordinating Center used TOPMed WGS data from the TOPMed freeze 8 release and computed kinship matrix, tabulating the genetic relationship between TOPMed participants. We subsetted this matrix into JHS participants. Concentration levels of 1,317 proteins in plasma were measured using the SomaScan platform.27 The JHS study followed procedures in accordance with the ethical standards of the responsible committee. It was approved by Jackson State University, Tougaloo College, and the University of Mississippi Medical Center Institutional Review Boards, and all participants provided written informed consent.
We excluded five proteins with more than 80% missing values. The remaining dataset had no missing protein measurements. Protein measurements were adjusted for batch effect by rank-normalizing each protein separately in each batch and then aggregating the data across batches. Next, the protein measurements were regressed over (a) only the intercept (i.e., the measurements were centered, and the analysis not adjusted for covariates), and (b) over age, sex, and BMI. The residuals from each of these regressions were extracted and were used for estimating heritabilities and genetic correlations between all protein pairs using Haseman-Elston estimators provided in Equations 4 and 5. In addition, we compared the estimates of genetic correlations with estimated Pearson correlations calculated using stats R package (version 3.6.2).
Simulation studies
We used the kinship matrix from the JHS data to perform simulations. It is important to clarify that there are two separate types of simulations: first, for a given dataset (including a synthetic dataset), there are simulations that are done as part of the parametric bootstrap step (as in the first, top, step in Figure 1). Second, we perform simulation studies to evaluate the statistical methodology. For these, we generate data and use it for the second step in Figure 1: we use the parameters estimated in each simulation repetition and the parametric bootstrap to compute CIs. Thus, we used the kinship matrix from the JHS data to both perform parametric bootstrap (which includes simulations) and next to generate data that next used the parametric bootstrap results.
To study methods’ performance in larger sample sizes, we also created synthetic datasets mimicking the JHS in which we used block matrices, with blocks being the original JHS kinship matrix using n = 2,050 individuals (setting A). We used two and three times the original sample size to form block diagonal kinship matrices with n = 4,100 and n = 6,150 (settings B and C). Thus, we used these kinship matrices to (a) perform simulations for the parametric bootstrap procedure, where in the primary simulations we fix = 0.2 as a conservative potential high value of . We also performed simulations comparing the choice of . Next, (b) we generate new simulated data that used the results of the parametric bootstrap simulations (a) to construct CIs. We performed 10,000 simulations for each combination of potential , with and . We constructed CIs for the estimated in each simulation.
To further assess computational performance in larger sample sizes, we also performed limited analyses where we used the full JHS dataset available in TOPMed freeze 8 (n = 3,418), and created a block diagonal kinship matrix with 3 and 10 times this matrix, with n = 10,254 and n = 34,180 (settings D and E). In these simulations, we had = 0, and assessed CI computation performance using a small number of 1,000 simulations for each combination of potential .
Comparison: Four approaches for constructing CIs
We estimated the coverage and the width of the CIs constructed using the two proposed approaches and two existing approaches: (a) percentiles of the empirical PMF constructed using the parametric bootstrap approach (proposed); (b) beta approximation to the empirical PMF (proposed); and the existing methods: (c) Fisher’s transformation; and (d) normal approximation to the distribution of the estimated genetic correlation implemented in the genome-wide complex trait analysis (GCTA) package.28
The Fisher’s transformation method assumes that genetic correlations follow the same distributions as Pearson correlations.29 More specifically, they are normally distributed after Fisher’s transformation. For genetic correlation ,
With being the “effective sample size,” previously proposed to be , with being the kinship matrix with diagonal values set to zero.15 We can then construct the CIs of z by the standard approach, assuming asymptotic normal distributions. For example, the 95% CI of z would be [ , ]. After finding the 100∗( )% CIs of on the Fisher’s transformed ( scale, we apply the inverse Fisher’s transform to retrieve the CIs of the genetic correlation .
To compute CIs based on existing approach that rely on a normal approximation, we estimate both the genetic correlation and its SE using the bivariate REML procedure implemented in the GCTA package. We apply the [ ] formula to construct 95% CIs. Due to the high computational resources required by GCTA, we focus only on the four scenarios when true equals {0.05, 0.15, 0.45, 0.95} with the original-size kinship matrix.
Performance evaluation metrics
We used coverage probabilities and CI widths as the metrics to evaluate and compare the performance of the four approaches for CI construction. In primary results, for a given true value of genetic correlation we calculated both the coverage probability and the average width of 95% CIs using the constructed CIs for the estimated over all the 100 true heritability scenarios (10 for and 10 for ). Ideally, the coverage probabilities would be at or above 95% across different , and also having small CI widths. The coverage probability was estimated as the proportion of simulations in which the true was contained in its CI.
Type 1 error when testing genetic correlations
For each combination of potential heritability values ( ), we simulated 10,000 pairs of phenotype vectors ( ) under the null, i.e., , estimated their genetic correlations, and calculated their p values as described above based on the beta approximation to the PMF. After obtaining the p values for all the 10,000 simulated data, we estimated the type 1 error rate, also called the size of the test, by checking the percentage of these simulation rejecting the null given an value.
Computational optimizations and assessments
The computational burden for performing parametric bootstrap increases with the size of the kinship matrix. We implemented a few approaches to optimize computations, all of which rely on the realistic assumption that the kinship matrix can be ordered by blocks, where individuals in different blocks are unrelated to each other. In other words, the kinship coefficient describing relatedness between individuals in different blocks is set to zero, to create a block matrix. While this may lead to some loss of information (as genetic correlation estimation using individual-level data relies on genetic relatedness between individuals), this approach is standard in the analysis of large genetic datasets,30 because it balances the low contribution of small kinship coefficients with the fact that computations are made possible by this sparsity.
Thus, to enable analysis of medium-to-large genetic datasets, we implemented all computations by blocks of the kinship matrix. Because the JHS is relatively small, we artificially increased it for this analysis by (a) using a kinship matrix from all available JHS individuals in the dataset , regardless of having protein measurements (n = 3,418), and (b) generating larger, blocked, kinship matrices where the blocks are . We generated and , with the and matrix being block matrices with 3 and 10 blocks, corresponding with datasets with sample sizes of 10,254 and 34,180, respectively (settings D and E).
Specific optimization steps speeding up the parametric bootstrap included the following.
- (1)Simulating outcomes from a multivariate normal distribution: The parametric bootstrap requires generating random outcomes from a multivariate normal distribution corresponding to the kinship matrix. We sampled these outcomes according to the blocks of the sparse block kinship matrix, and used a function from the mvnfast R package, using multiple cores. The package provides computational optimization through the Rcpp\RcppArmadillo packages and parallelization through the OpenMP API. We aggregated the resulting simulated outcomes at the end of the computations.
- (2)Estimation of genetic correlation and heritabilities between each pair of simulated outcomes: We implemented two custom functions using C++ combined with parallelized computations via multithreading. Thus, each of the two functions performs computations on the dataset in multiple blocks in parallel using C++ multithreading and the Matrix C++ class for performing mathematical operations on matrices efficiently. Results computed by each thread for separate blocks are then aggregated into the final result. Additionally, we further reduced the memory requirements (RAM) by performing computations of genetic correlations and heritabilities in blocks of 1,000 simulations at a time (this is faster than computations of one simulation at a time).
All code is available on the open access GitHub repository https://github.com/YanaHrytsenko/Genetic_Correlations_CIs. All the computations were performed on the Amazon Web Services platform using r5a.8xlarge instances with 32 vCPUs and 256G of RAM available on each instance.
To assess the computational needs of the parametric bootstrap approach, we estimated the peak RAM use and computational time of a single parametric bootstrap computing “job” (averaged over 10 jobs). In our implementation, the parametric bootstrap (Figure 1, top) is performed in 2,000 computational jobs, according to the number of combinations of the heritability and genetic correlation parameters, each one performing 10,000 simulations. We provide these computing measures when using , and
Results
Simulation studies
In primary results, for a given true value of genetic correlation we calculated both the coverage probability and the average width of 95% CIs by averaging the corresponding estimates over all the 100 true heritability scenarios (10 for and 10 for ). Table S1 provides estimated coverage by all combinations of (true) . Figure 2 provides the estimated coverage probabilities for the compared methods in simulations, and Figure 3 provides the averaged CI widths. The PMF approach provides appropriate coverage across the three settings defined by the kinship matrices. The beta approximation to the PMF results in under-coverage across the simulated values setting A, but improved substantially in settings B and C when the simulated sample size increased. Still the average width of the CIs was lower when using the empirical PMF. In setting A, GCTA had an appropriate coverage only when was set to 0.05. The Fisher’s transformation tended to result in under-coverage.Figure 2. Estimated coverage of 95% CIs of genetic correlations in the primary simulationsThe columns represent different kinship matrix sizes: Setting A denotes the use of original-size kinship matrix (n = 2,050), setting B denotes the use of the double-size kinship matrix (n = 4,100), and setting C denotes the use of the triple-size kinship matrix (n = 6,150). The rows represent the four approaches for constructing CIs, including parametric bootstrap PMF, beta approximation for parametric bootstrap PMF, Fisher’s transformation, and GCTA package use of normal distribution approximation. Only parts of the analyses were carried out on the GCTA package due to the high computational resources required.Figure 3. Average widths of the CIs of the genetic correlations in the primary simulationsThe columns represent different kinship matrix sizes: Setting A denotes the use of original-size kinship matrix (n = 2,050), setting B denotes the use of the double-size kinship matrix (n = 4,100), and setting C denotes the use of the triple-size kinship matrix (n = 6,150). The rows represent the four approaches for constructing CIs, including parametric bootstrap PMF, beta approximation for parametric bootstrap PMF, Fisher’s transformation, and GCTA package. Only parts of the analyses were carried out on the GCTA package due to the high computational resources required.
Figure 4 compares coverage and CI widths when using the empirical PMF approach to compute CIs and settings or in the simulations generating data. It demonstrates that there is essentially no difference in the results.Figure 4. Setting environmental correlation between error terms has no effect on genetic correlation estimatesComparison of coverage and CI widths when using the empirical PMF approach to compute CIs and settings or in the simulations generating data.
We also used the simulations to estimate type 1 error when using the CI inversion methods with the beta approximation to the PMF to compute association p values. The results are visualized in Figure 5. Here, we also estimated type 1 error by combinations of specific values. With , the type 1 error was controlled across heritability combinations in settings B and C, but not in setting A. While it is unsurprising that the type 1 error is not controlled when heritability values of either one of the two traits are very small, the test was also somewhat inflated in setting A when the two heritabilities were fairly high. Over all, the beta approximation method to the PMF is promising for computing high coverage CIs and p values when the sample size is sufficiently large.Figure 5. Type 1 error estimates when using the CI inversion method and the beta approximation to perform association testingVisualization of type 1 error ( ) when using the CI inversion approach coupled with the beta approximation to the PMF to generate CIs. The results are provided for each simulation settings and by values of .
Assessment of computing time and RAM needs
To thoroughly assess performance and computational needs in larger sample sizes, we also performed analyses of settings leveraging the kinship matrix of all JHS individuals available in the data , and larger, blocked matrices, 3X , and . The latter two settings (3X and ) are referred to as settings D and E, and for these we also report coverage and CI width results in Figures S1 and S2. These results demonstrate consistent patterns with settings A–C, with appropriate coverage when using both the PMF and the beta approximation to the PMF approaches, and under-coverage using the Fisher’s approximation near the boundaries of the parameter space. The figures also demonstrate that the average lengths of the CIs using the beta approximation are higher than the corresponding lengths when using the PMF.
We also conducted computational performance testing on Windows 11 Pro machine, version 22H2, 64-bit, with 13th Gen Intel Core i9-13900K, 3.00 GHz, 3.72 TB of storage, 128Gb of RAM, and 32 vCPUs using peakRAM R package. Recall that we apply the parametric bootstrap procedures as 2,000 computational “jobs” (using computing cluster language) that perform simulations (top part in Figure 1), each with a different combination of parameters. Here, we report peak RAM and time elapsed for each of the major computational tasks in a computational job, and the overall time elapsed for the complete job. These estimates are averaged across 10 runs of computational job. The results are provided in Table 1. One can see that peak RAM depends on the smallest block in the blocked kinship matrix, due to the implementation of computations by blocks. The computing time of genetic correlation and heritability estimation as a part of the parametric bootstrap increases with higher overall sample sizes. Overall computing times of the parametric bootstrap computing job were 1,516, 2,428, and 11,140 s for the parametric bootstrap simulations based on 3X , and , respectively.Table 1. Peak RAM and timing analysis of parametric bootstrap computational tasks Load kinship matrixRAM: 89.2time elapsed: 0.15RAM: 802.4time elapsed: 0.92RAM: 8915time elapsed: 11.76Sample 10,000 datasets from multivariate normal distribution (rmvn function)RAM: 7,119.12time elapsed: 1,317.48RAM: 7,275.57time elapsed: 1,432.53RAM: 7,823.31time elapsed: 1,514.61Estimate genetic correlation over all 10,000 simulated outcome pairsRAM: 14,820time elapsed: 21.57RAM: 14,820time elapsed: 304.60RAM: 3,4698.24time elapsed: 3,068.57Estimate heritability over all 10,000 simulated outcomesRAM: 14,820time elapsed: 10.10RAM: 14,820time elapsed: 149.57RAM: 18,892.8Time elapsed: 1,621.14Overall timetime elapsed: 1,515.67time elapsed: 2,427.97Time elapsed: 11,139.55The table reports computational time (seconds) and peak random access memory (RAM; MiB) for the parametric bootstrap main computational tasks in a single computing job and overall (the complete computational job). These analyses were performed using the kinship matrix of all available JHS individuals in the dataset (regardless of protein data) with , and over block matrices mimicking larger kinship matrix with n = 10,254, and with n = 34,180. Load kinship matrix refers to reading a full-sized kinship matrix saved as an R RDS file. Note that the heritability estimation step is performed twice: once for each simulated outcome out of the pair of outcomes.
Application to genetically determined protein-protein networks in the JHS
We estimated heritabilities and genetic correlations for every pair of proteins among the 1,317 proteins available in JHS, in an analysis adjusted to age, sex, and BMI (in which protein measures were first regressed over these covariates prior to estimation of genetic correlations based on the resulting residuals), and in an unadjusted analysis. Characteristics of the JHS dataset are provided in Table 2. Of the study participants, 61% were women. Individuals were 55 years of age (males) and 56 years of age (females) on average, and were mostly overweight. Some individuals were close relatives. For example, there were 341 pairs of individuals with estimated coefficient of relationship , and 1,113 pairs of individuals with coefficient of relationship (considering the total number of unique pairs of individuals, this corresponds with 0.05% of all pairs of participants).Table 2JHS dataset characteristics stratified by sexCharacteristicFemaleMaleN1,252798Age, yearsa57 (46–65)55 (45–65)BMIa32 (27–37)29 (26–32)aMedian (interquartile range).
Tables S2 and S3 provide the estimated heritabilities of all proteins in the dataset from analysis unadjusted and adjusted to covariates (age, sex, and BMI), respectively. Based on the simulations using this specific dataset, we removed from consideration proteins with estimated heritabilities , as genetic correlations and p values using the beta approximation method are less reliable compared with higher values of (real, not estimated) heritabilities. We also excluded proteins with estimated , because such a high may suggest a problem with the measurement and/or genetic characterization of a protein (e.g., technical issue with the platform, genetic variants segregated to a few families, etc.). After the above filtering, there were 403 and 431 proteins, or 81,406 and 93,096 protein-protein pairs, available for genetic correlation analysis in the covariate-adjusted and unadjusted analyses, respectively. For each set of the proteins (adjusted and unadjusted), it took approximately 2.5 h to estimate the genetic correlations and approximately 12 min to construct the CIs, based on the previously constructed parametric bootstrap reference results, for all the protein-protein pairs on a MacBook Pro laptop with an M1 chip. Full results from genetic correlation estimates for these sets of proteins are provided in Tables S4 and S5. Figure 6 visualizes the comparison between estimated phenotypic (Pearson) and genetic correlations across these phenotype pairs. The figure suggests that, for this set of highly heritable proteins, genetic correlations tend to be higher than Pearson correlations (to see this, one needs to focus in Figure 5 on the bright hexbins because they represent many more protein pairs compared to dark hexbins).Figure 6. Estimated Pearson versus genetic correlations between heritable proteinsThe figure compares the sample Pearson correlation with the estimated genetic correlation for all protein pairs for which the estimated heritability for each of the proteins. The color of each hexbin represents the number (count) of protein pairs with x- and y axis values falling under the hexbin.
Protein-protein network
We visualize the results in a protein-protein network. Due to the large number of protein pairs, we focused the network resulting from protein-protein genetic correlations passing a p value threshold. We computed p values for the genetic correlations between the limited set of heritable and “valid” proteins (with heritability estimates that are not egregiously high) using the beta approximation to the PMF, and applied a false discovery rate (FDR) correction using the Benjamini-Hochberg procedure.31 The considered pairs of proteins are those with FDR-adjusted genetic correlation p < 0.01. This corresponds with 253 and 294 pairs of genetically correlated proteins in adjusted and unadjusted analysis, respectively. Figure 7 visualizes these results. The size of each node represents its degree, with larger ones being “hub nodes/proteins,” (genetically) associated with a large number of proteins. See Tables S6 and S7 for estimated genetic correlations between pairs of proteins selected based on the criteria described above. Table S8 contains a list of the top 10 hub nodes/proteins and their connections, i.e., the list of proteins connected to each of these hub nodes, both in covariate-adjusted and unadjusted analyses. Visually, the network seems to be less connected (and we also know that the number of connections decreased) in analysis that adjusted for age, sex, and BMI. It is likely that genetic correlations decreased because BMI has strong effects on proteins, and the genetic effects on BMI are also strong, so when BMI was adjusted for, genetic effects inducing correlations between proteins were reduced.Figure 7. Networks constructed from top pairs of genetically correlated proteins(A) Protein-protein genetic correlation network using the age, sex, BMI-adjusted proteins. (B) The corresponding network based on unadjusted analysis. The blue edges represent positive genetic correlations, and the gray edges represent negative genetic correlations. Larger nodes are hub proteins where multiple proteins have strong genetic correlations with each other both in covariate adjusted and unadjusted analyses. Some of the hub proteins include APO_D, PREKALLIKREIN, NOTCH_3, HPV_E7_TYPE18, CARBONIC_ANHYDRASE_IV, CDK5_P35, DKK_4, PAK3, TRKC, MIS, C5A, OMD, JAG1, HEPARIN_COFACTOR_II, BFGF_R, and MMP_2, and GDF_11_8.
Discussion
We developed a parametric bootstrap procedure to estimate CIs for the genetic correlation estimator, studied it in simulations, and applied it to learn a protein-protein network using a set of heritable proteins measured in the JHS. The proposed procedure outperforms existing methods in that the resulting CI are not under-covered across the range of potential genetic correlation parameters.
Our bootstrap procedure was inspired by a similar approach developed for heritability CIs.19 Compared with the previous publication focusing on heritability, our approach is complicated by the need to simulate pairs of traits, including their heritabilities and genetic correlation between them, i.e., a grid of three parameters rather than one. Indeed, CIs for genetic correlation depend on trait heritability and are wider when at least one of the traits has low heritability. Thus, the computation burden of our procedure is higher. It is especially important to recognize that this procedure, like that of Schweiger et al.,19 is dataset dependent, because it uses the kinship matrix of the specific dataset. However, our procedure is realistic and useful when many genetic correlations are estimated for the same dataset, as in this work. In this case, the parametric bootstrap simulation step is performed once, but is applied many times. A limitation posed by the high dimensional number of parameters required by the bootstrap procedure (many genetic correlation parameters) is the limited level of coverage due to the discreteness of the bootstrap procedure: we cannot use the estimated conditional PMF of (conditional on the estimated genetic correlation and heritabilities), as it is to obtain CIs at the 1- level when is very small (e.g., ). To address this, we proposed the beta approximation, after local-scale transformation, to the PMF. The beta distribution has two parameters that can be fit to approximate many distribution functions that are on a bounded interval. Based on our simulations, CIs based on the beta approximation tend to be wider than those using the PMF directly, and they can still under-cover the desired distribution in low sample sizes. However, for larger sample sizes their performance improves. Overall, we think that for larger sample sizes, e.g., 6,000 individuals, the beta approximation to the PMF will be very useful in providing reliable CIs and, using the inversion method, p values. It is important to point out that while we performed simulations with a “triple size” JHS kinship matrix, i.e., of n = 6,150 individuals, the effective sample size corresponding with it is much lower than that of real potential datasets with 6,150 individuals. That is, because we simulated a block diagonal matrix. Realistic kinship matrices will have non-zero off diagonal values throughout (unless forced to be zero for computational efficiency purposes30).
We implemented a binary search procedure to compute p values, via the computation of CIs of varying degrees of coverage, i.e., the binary search is on the level of the coverage parameter—where the binary search stops when the maximum coverage that does not include 0 is attained (and once we see that the next potential level of coverage results in a CI that includes 0). Binary search is useful when (a) the function is monotone (as the value decreases eventually we will “yes” cover zero, for all smaller ), and (b) the function cannot be transformed: we cannot create a function that takes as an input “yes” or “no” and output an value. The binary search uses a sensitivity parameter, which may be determined by the desired level. For example, if a researcher prespecifies a criterion for “significance” of genetic correlation as p < 1 × 10^−5^, it is important to be able to determine whether a CI value when = 1 × 10^−5^ covers zero, and so the sensitivity level should be no higher than 1 × 10^−5^; otherwise, associations with p values close to this value may not unambiguously be identified as statistically significant. Thus, the sensitivity level affects power and type 1 error only in the sense that one could determine that a p value is lower than a pre-specified level, but no lower (e.g., one could determine that a p < or that but not that p = for a ). Here, we note that another limitation to the attain sensitivity is that the PMF computation is based on a limited number of simulations, again resulting in a discreate function and thus limited coverage/p value sensitivity. The beta approximation to the PMF addresses this limitation, because the beta function is continuous and, therefore, does not have limited sensitivity.
Existing methods that compute CIs for genetic correlations typically utilize an asymptotic normal distribution argument, at either the untransformed or Fisher’s transformation level. This is appropriate depending on the combination of four factors: sample size, underlying (true) heritabilities of each of the pair of traits, and the underlying genetic correlation. For any given pair of traits and a dataset, any one of these factors may be suboptimal, potentially leading to poor performance of CIs that rely on asymptotic normality. The bootstrap procedure addresses this shortcoming. However, this procedure too does not produce perfect CIs: for low values of heritability of either one of the two traits, the coverage may still be lower than desired in low sample sizes. Note that in reality we do not know the true heritability, we only have estimated heritability. Therefore, we cannot tell whether a CI may not be reliable according to the values of the estimated heritabilities. That is why our main results are provided at an aggregate level, across simulated values of potential heritabilities.
We demonstrated the use of genetic correlations to infer genetically determined protein-protein networks. However, we acknowledge that our analysis is limited by the relatively low sample size, which led to posing a stringent filter requiring at least 0.3 estimated protein heritability for inclusion in the downstream analysis. While we chose to include only edges with estimated FDR-adjusted p < 0.01 (with p values estimated using the beta approximation), other statistical network approaches may generate sparsity using penalized multivariant regression techniques.32^,^33 It would be interesting to extend such approaches to genetic (rather than phenotypic) correlation-based networks. In future work, we will apply the existing framework on larger datasets and develop approximation methods to further speed up the simulations required for the parametric bootstrap and the estimation of heritabilities and genetic correlations, for example, following Wu et al.34
Data and code availability
Code to implement the parametric bootstrap approach is provided in the public GitHub repository https://github.com/YanaHrytsenko/Genetic_Correlations_CIs. Genetic data from the JHS are available via a data use agreement with the database of genotypes and phenotypes (dbGaP), study accession "dbGaP: phs000286". WGS data are available via a data use agreement with dbGaP project “NHLBI TOPMed: The Jackson Heart Study (JHS),” study accession "dbGaP: phs000964". The JHS proteomics dataset used in this analysis is available via a data use agreement with the JHS Data Coordinating Center (DCC), see website https://www.jacksonheartstudy.org/. Alternatively, JHS DCC can be contacted via the email [email protected].
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1van Rheenen W.Peyrot W.J.Schork A.J.Lee S.H.Wray N.R.Genetic correlations of polygenic disease traits: from theory to practice Nat. Rev. Genet.2020195675813117186510.1038/s 41576-019-0137-z · doi ↗ · pubmed ↗
- 2Visscher P.M.Hill W.G.Wray N.R.Heritability in the genomics era--concepts and misconceptions Nat. Rev. Genet.920082552661831974310.1038/nrg 2322 · doi ↗ · pubmed ↗
- 3Bulik-Sullivan B.Finucane H.K.Anttila V.Gusev A.Day F.R.Loh P.-R.Repro Gen Consortium Psychiatric Genomics Consortium Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium 3Duncan L.An atlas of genetic correlations across human diseases and traits Nat. Genet.472015123612412641467610.1038/ng.3406 PMC 4797329 · doi ↗ · pubmed ↗
- 4O’Connor L.J.Price A.L.Distinguishing genetic correlation from causation across 52 diseases and complex traits Nat. Genet.502018172817343037407410.1038/s 41588-018-0255-0PMC 6684375 · doi ↗ · pubmed ↗
- 5Zhang Y.Elgart M.Kurniansyah N.Spitzer B.W.Wang H.Kim D.Shah N.Daviglus M.Zee P.C.Cai J.Genetic determinants of cardiometabolic and pulmonary phenotypes and obstructive sleep apnoea in HCHS/SOLE Bio Medicine 84202210428810.1016/j.ebiom.2022.104288 PMC 951543736174398 · doi ↗ · pubmed ↗
- 6Ikeda M.Tanaka S.Saito T.Ozaki N.Kamatani Y.Iwata N.Re-evaluating classical body type theories: genetic correlation between psychiatric disorders and body mass index Psychol. Med.482018174517482965197510.1017/S 0033291718000685 PMC 6088781 · doi ↗ · pubmed ↗
- 7Kappelmann N.Arloth J.Georgakis M.K.Czamara D.Rost N.Ligthart S.Khandaker G.M.Binder E.B.Dissecting the Association Between Inflammation, Metabolic Dysregulation, and Specific Depressive Symptoms: A Genetic Correlation and 2-Sample Mendelian Randomization Study JAMA Psychiatr.78202116117010.1001/jamapsychiatry.2020.3436 PMC 757720033079133 · doi ↗ · pubmed ↗
- 8Shi H.Mancuso N.Spendlove S.Pasaniuc B.Local Genetic Correlation Gives Insights into the Shared Genetic Architecture of Complex Traits Am. J. Hum. Genet.10120177377512910008710.1016/j.ajhg.2017.09.022PMC 5673668 · doi ↗ · pubmed ↗
